Offline Internet Archive

Introduction to the Offline Internet Archive project

The internet now seems like a utility, available everywhere from our homes and offices to trains and planes. But utility-level access is not yet a reality for more than half of the world’s population who lack consistent, or indeed any access, to the Internet.

Why?

Cost: Internet access is unaffordable to people with low or no income.
Connectivity: In many developing countries and rural areas the infrastructure that enables internet access is unreliable, slow, or simply unavailable. Natural disasters, uprisings, and war compound the challenge.
Censorship: Some countries limit internet access for political reasons. Several countries block the Internet Archive. In some countries, Facebook has become synonymous with the internet – but it is hardly a substitute for free and open World Wide Web.

The Internet Archive offers perhaps the world’s largest online store of open content. The wisdom of the ages, just a few clicks away. As Wikipedia has become the world’s encyclopedia, the Internet Archive has become its library. Central to our mission is establishing “Universal Access to All Knowledge”. Access to our library of millions of books, journals, audio and video recordings and beyond is free to anyone — with one caveat — the need for a reliable internet connection.

Lack of access to today’s internet is a significant factor in poorer educational outcomes, inter-generational poverty and disempowerment as identified by the UN in their Sustainable Development Goal #9. The Offline Archive project works towards making online collections available — regardless of internet availability.

Part of the challenge is that those of us who live where the Internet works well, are adding graphics, video and other demands on bandwidth faster than access is being improved in many parts of the world.

An evolving ecosystem is emerging to enable access over poorer internet. Typically the approaches build around low cost, low power, devices that can be installed, in communities and schools for example, and deliver content either offline or through better usage of a narrow pipe to the net.

We have built an offline server that:

Crawls Internet Archive collections to a local server,
Serves that content locally,
Caches content while browsing,
Moves content between servers by sneakernet — on disks, USB sticks, and SD cards,
Delivers (mostly) the Internet Archive UI offline in javascript in the browser,
Is open source,
And is being made available in other languages.

The server is integrated into the Internet-In-A-Box (IIAB) platform, and can be installed on top of the Rachel platform, or hopefully any linux based platform. Our approach should improve access for anything from a US$20 Raspberry Pi up to a server holding terabytes of data for an institution. We are also collaborating with other parts of the ecosystem, integrating the Archive’s APIs with those of other partners, to make it easier for them to incorporate Archive content.

Contributing

We'd love to have you contribute, please email [email protected], or interact with the rest of this repo, and I'll figure out how to help you get started. (TODO setup a better channel for this !)

Installation

If you would like to run the offline archive server then see INSTALLATION.md, and the documents it points to.

If you want to fix bugs, develop code or contribute in other ways then see INSTALLATION-dev.md. (Note this document was written for Mac OSX users, a useful task would be for someone with a Linux machine to make any edits to it if required, or just confirms it is correct.)

Also see these documents to update an existing installation, Or to troubleshoot an existing installation.

Using it - starting the server.

See the Installation docs, but on most platforms (except, currently, on Mac OSX) the server should start at reboot.

If not, then assuming you've got it installed in your home directory ...

cd ~/node_modules/dweb-mirror && ./internetarchive --server &

Or a slightly different location for the developers.

The startup is a little slow but you'll see some debugging when its live

On platforms where it starts automatically (e.g. IIAB, Rachel), it can be turned on or off at a terminal window with service internetarchive start or service internetarchive stop

Browsing

Open the web page - the address depends on the platform.

http://archive.local:4244 or http://archive:4244 should work on any platform, but this depends on the configuration of your LAN.
If you know the IP address then http::4244 will work
On MacOSX (or if using a browser on the RaspberryPi/OrangePi): http://localhost:4244
On Rachel try http://rachel.local:4244 or http://rachel:4244 or via the main interface at http://rachel.local and click Internet Archive
On IIAB The server can be accessed at http://box:4244 or http://box.lan:4244 (try http://box.local:4244 via mDNS over a local network, if you don't have name resolution set up to reach your Internet-in-a-Box).

Try walking through ./USING.md to get a tour of the system, and you can click Home or the Internet Archive logo, if you just want to explore the Internet Archive's resources.

Administration

Administration is carried out mostly through the same User Interface as browsing.

Select local from any of the pages to access a display of local content. Administration tools are under Settings.

Click on the Archive logo, in the center-top, to get the Internet Archive main interface if connected to the net.

While viewing an item or collection, the Crawl button in the top bar indicates whether the item is being crawled or not. Clicking it will cycle through three levels:

No crawling
Details - sufficient information will be crawled to display the page, for a collection this also means getting the thumbnails and metadata for the top items.
Full - crawls everything on the item, this can be a LOT of data, including full size videos etc, so use with care if bandwidth/disk is limited.

Disk storage

The server checks for caches of content in directories called archiveorg in all the likely places, in particular it looks for any inserted USB drives on most systems, and if none are found, it uses ~/archiveorg.

The list of places it checks, in an unmodified installation can be seen at https://github.com/internetarchive/dweb-mirror/blob/master/configDefaults.yaml#L7.

You can override this in dweb-mirror.config.yaml in the home directory of the user that runs the server. (Note on IIAB this is currently in /root/dweb-mirror.config.yaml) (see 'Advanced' below)

Archive's Items are stored in subdirectories of the first of these directories found, but are read from any of the locations.

If your disk space is getting full, its perfectly safe to delete any subdirectories (except archiveorg/.hashstore), and the server will refetch anything else it needs next time you browse to the item while connected to the internet.

It is also safe to move directories to an attached USB (underneath a archiveorg directory at the top level of the disk) It is also safe to move attached USB's from one device to another.

Some of this functionality for handling disks is still under active development, but most of it works now.

Maintenance

If you are worried about corruption, or after for example hand-editing or moving cached items around.

Run everything as root

sudo su

cd into location for your installation

cd ~/node_modules/@internetarchive/dweb-mirror
./internetarchive -m

This will usually take about 5-10 minutes depending on the amount of material cached, just to rebuild a table of checksums.

Advanced

Most functionality of the tool is controlled by two YAML files, the second of which you can edit if you have access to the shell.

You can view the current configuration by going to /info on your server. The default, and user configurations are displayed as the 0 and 1 item in the /info call.

In the Repo is a default YAML file which is commented. You really should never need to edit this file, as anything in it can be overridden by lines in ~/dweb-mirror.config.yaml. Make sure you understand how yaml works before editing this file, if you break it, you can copy a new default from dweb-mirror.config.yaml on the repo

Note that this file is also edited automatically when the Crawl button described above is clicked.

As the project develops, this file will be more and more editable via a UI.

Crawling

The Crawler runs automatically at startup and when you add something to the crawl, but it can also be configurable through the YAML file described above or run at a command line for access to more functionality.

In a shell

sudo sh

cd into the location for your installation, on most platforms it is

cd ~/node_modules/@internetarchive/dweb-mirror

Or on IIAB it would be

cd /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-mirror

Perform a standard crawl

./internetarchive --crawl

To fetch the "foobar" item from IA

./internetarchive --crawl foobar

To crawl top 10 items in the prelinger collection sufficiently to display and put them on a disk plugged into the /media/pi/xyz

./internetarchive --copydirectory /media/pi/xyz/archiveorg --crawl --rows 10 --level details prelinger

To get a full list of possible arguments and some more examples

./internetarchive --help

More info

I recommend following through the tour in USING.md

Dweb-Mirror lives on GitHub at:

dweb-mirror (the server) source, and issues tracker
dweb-archive (the UI) source, and issues tracker

This project is part of the Internet Archive's larger Dweb project, see also:

dweb-universal info about others working to bring access offline.
dweb-transports for our transport library to IPFS, WEBTORRENT, WOLK, GUN etc
dweb-archivecontroller for an object oriented wrapper around our APIs

internetarchive/dweb-mirror

internetarchive

Reviews

Repository Details