• Stars
    star
    621
  • Rank 72,294 (Top 2 %)
  • Language
    TypeScript
  • Created over 7 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Putting Wikipedia Snapshots on IPFS

Distributed Wikipedia Mirror Project

Putting Wikipedia Snapshots on IPFS and working towards making it fully read-write.

Existing Mirrors

There are various ways one can access the mirrors: through a DNSLink, public gateway or directly with a CID.

You can read all about the available methods here.

DNSLinks

CIDs

The latest CIDs that the DNSLinks point at can be found in snapshot-hashes.yml.


Each mirror has a link to the original Kiwix ZIM archive in the footer. It can be dowloaded and opened offline with the Kiwix Reader.

Table of Contents

Purpose

“We believe that information—knowledge—makes the world better. That when we ask questions, get the facts, and are able to understand all perspectives on an issue, it allows us to build the foundation for a more just and tolerant society” -- Katherine Maher, Executive Director of the Wikimedia Foundation

Wikipedia on IPFS -- Background

What does it mean to put Wikipedia on IPFS?

The idea of putting Wikipedia on IPFS has been around for a while. Every few months or so someone revives the threads. You can find such discussions in this github issue about archiving wikipedia, this issue about possible integrations with Wikipedia, and this proposal for a new project.

We have two consecutive goals regarding Wikipedia on IPFS: Our first goal is to create periodic read-only snapshots of Wikipedia. A second goal will be to create a full-fledged read-write version of Wikipedia. This second goal would connect with the Wikimedia Foundation’s bigger, longer-running conversation about decentralizing Wikipedia, which you can read about at https://strategy.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

(Goal 1) Read-Only Wikipedia on IPFS

The easy way to get Wikipedia content on IPFS is to periodically -- say every week -- take snapshots of all the content and add it to IPFS. That way the majority of Wikipedia users -- who only read wikipedia and don’t edit -- could use all the information on wikipedia with all the benefits of IPFS. Users couldn't edit it, but users could download and archive swaths of articles, or even the whole thing. People could serve it to each other peer-to-peer, reducing the bandwidth load on Wikipedia servers. People could even distribute it to each other in closed, censored, or resource-constrained networks -- with IPFS, peers do not need to be connected to the original source of the content, being connected to anyone who has the content is enough. Effectively, the content can jump from computer to computer in a peer-to-peer way, and avoid having to connect to the content source or even the internet backbone. We've been in discussions with many groups about the potential of this kind of thing, and how it could help billions of people around the world to access information better -- either free of censorship, or circumventing serious bandwidth or latency constraints.

So far, we have achieved part of this goal: we have static snapshots of all of Wikipedia on IPFS. This is already a huge result that will help people access, keep, archive, cite, and distribute lots of content. In particular, we hope that this distribution helps people in Turkey, who find themselves in a tough situation. We are still working out a process to continue updating these snapshots, we hope to have someone at Wikimedia in the loop as they are the authoritative source of the content. If you could help with this, please get in touch with us at [email protected].

(Goal 2) Fully Read-Write Wikipedia on IPFS

The long term goal is to get the full-fledged read-write Wikipedia to work on top of IPFS. This is much more difficult because for a read-write application like Wikipedia to leverage the distributed nature of IPFS, we need to change how the applications write data. A read-write wikipedia on IPFS would allow it to be completely decentralized, and create an extremely difficult to censor operation. In addition to all the benefits of the static version above, the users of a read-write Wikipedia on IPFS could write content from anywhere and publish it, even without being directly connected to any wikipedia.org servers. There would be automatic version control and version history archiving. We could allow people to view, edit, and publish in completely encrypted contexts, which is important to people in highly repressive regions of the world.

A full read-write version (2) would require a strong collaboration with Wikipedia.org itself, and finishing work on important dynamic content challenges -- we are working on all the technology (2) needs, but it's not ready for prime-time yet. We will update when it is.

How to add new Wikipedia snapshots to IPFS

The process can be nearly fully automated, however it consists of many stages and understanding what happens during each stage is paramount if ZIM format changes and our build toolchain requires a debug and update.

  • Manual build are useful in debug situations, when specific stage needs to be executed multiple times to fix a bug.
    • mirrorzim.sh automates some steps for QA purposes and ad-hoc experimentation

Note: This is a work in progress.. We intend to make it easy for anyone to create their own wikipedia snapshots and add them to IPFS, making sure those builds are deterministic and auditable, but our first emphasis has been to get the initial snapshots onto the network. This means some of the steps aren't as easy as we want them to be. If you run into trouble, seek help through a github issue, commenting in the #ipfs channel on IRC, or by posting a thread on https://discuss.ipfs.io.

Manual build

If you would like to create an updated Wikipedia snapshot on IPFS, you can follow these steps.

Step 0: Clone this repository

All commands assume to be run inside a cloned version of this repository

Clone the distributed-wikipedia-mirror git repository

$ git clone https://github.com/ipfs/distributed-wikipedia-mirror.git

then cd into that directory

$ cd distributed-wikipedia-mirror

Step 1: Install dependencies

Node and yarn are required. On Mac OS X you will need sha256sum, available in coreutils.

Install the node dependencies:

$ yarn

Then, download the latest zim-tools and add zimdump to your PATH. This tool is necessary for unpacking ZIM.

Step 2: Configure your IPFS Node

It is advised to use separate IPFS node for this:

$ export IPFS_PATH=/path/to/IPFS_PATH_WIKIPEDIA_MIRROR
$ ipfs init -p server,local-discovery,flatfs,randomports --empty-repo

Tune DHT for speed

Wikipedia has a lot of blocks, to publish them as fast as possible, enable Accelerated DHT Client:

$ ipfs config --json Experimental.AcceleratedDHTClient true

Tune datastore for speed

Make sure repo uses flatfs with sync set to false:

$ ipfs config --json 'Datastore.Spec.mounts' "$(ipfs config 'Datastore.Spec.mounts' | jq -c '.[0].child.sync=false')"

NOTE: While badgerv1 datastore is faster is nome configurations, we choose to avoid using it with bigger builds like English because of memory issues due to the number of files. Potential workaround is to use filestore that avoids duplicating data and reuses unpacked files as-is.

HAMT sharding

Make sure you use go-ipfs 0.12 or later, it has automatic sharding of big directories.

Step 3: Download the latest snapshot from kiwix.org

Source of ZIM files is at https://download.kiwix.org/zim/wikipedia/ Make sure you download _all_maxi_ snapshots, as those include images.

To automate this, you can also use the getzim.sh script:

First, download the latest wiki lists using bash ./tools/getzim.sh cache_update

After that create a download command using bash ./tools/getzim.sh choose, it should give an executable command e.g.

Download command:
    $ ./tools/getzim.sh download wikipedia wikipedia tr all maxi latest

Running the command will download the choosen zim file to the ./snapshots directory.

Step 4: Unpack the ZIM snapshot

Unpack the ZIM snapshot using extract_zim:

$ zimdump dump ./snapshots/wikipedia_tr_all_maxi_2021-01.zim --dir ./tmp/wikipedia_tr_all_maxi_2021-01

ℹ️ ZIM's main page

Each ZIM file has "main page" attribute which defines the landing page set for the ZIM archive. It is often different than the "main page" of upstream Wikipedia. Kiwix Main page needs to be passed in the next step, so until there is an automated way to determine "main page" of ZIM, you need to open ZIM in Kiwix reader and eyeball the name of the landing page.

Step 5: Convert the unpacked zim directory to a website with mirror info

IMPORTANT: The snapshots must say who disseminated them. This effort to mirror Wikipedia snapshots is not affiliated with the Wikimedia foundation and is not connected to the volunteers whose contributions are contained in the snapshots. The snapshots must include information explaining that they were created and disseminated by independent parties, not by Wikipedia.

The conversion to a working website and the appending of necessary information is is done by the node program under ./bin/run.

$ node ./bin/run --help

The program requires main page for ZIM and online versions as one of inputs. For instance, the ZIM file for Turkish Wikipedia has a main page of Kullanıcı:The_other_Kiwix_guy/Landing but https://tr.wikipedia.org uses Anasayfa as the main page. Both must be passed to the node script.

To determine the original main page use ./tools/find_main_page_name.sh:

$ ./tools/find_main_page_name.sh tr.wikiquote.org
Anasayfa

To determine the main page in ZIM file open in in a Kiwix reader or use zimdump info (version 3.0.0 or later) and ignore the A/ prefix:

$ zimdump info wikipedia_tr_all_maxi_2021-01.zim
count-entries: 1088190
uuid: 840fc82f-8f14-e11e-c185-6112dba6782e
cluster count: 5288
checksum: 50113b4f4ef5ddb62596d361e0707f79
main page: A/Kullanıcı:The_other_Kiwix_guy/Landing
favicon: -/favicon

$ zimdump info wikipedia_tr_all_maxi_2021-01.zim | grep -oP 'main page: A/\K\S+'
Kullanıcı:The_other_Kiwix_guy/Landing

The conversion is done on the unpacked zim directory:

node ./bin/run ./tmp/wikipedia_tr_all_maxi_2021-02 \
  --hostingdnsdomain=tr.wikipedia-on-ipfs.org \
  --zimfile=./snapshots/wikipedia_tr_all_maxi_2021-02.zim \
  --kiwixmainpage=Kullanıcı:The_other_Kiwix_guy/Landing \
  --mainpage=Anasayfa

Step 6: Import website directory to IPFS

Increase the limitation of opening files

In some cases, you will meet an error like could not create socket: Too many open files when you add files to the IPFS store. It happens when IPFS needs to open more files than it is allowed by the operating system and you can temporarily increase this limitation to avoid this error using this command.

ulimit -n 65536

Add immutable copy

Add all the data to your node using ipfs add. Use the following command, replacing $unpacked_wiki with the path to the website that you created in Step 4 (./tmp/wikipedia_en_all_maxi_2018-10).

$ ipfs add -r --cid-version 1 --offline $unpacked_wiki

Save the last hash of the output from the above process. It is the CID of the website.

Step 7: Share the root CID

Share the CID of your new snapshot so people can access it and replicate it onto their machines.

Step 8: Update *.wikipedia-on-ipfs.org

Make sure at least two full reliable copies exist before updating DNSLink.

mirrorzim.sh

It is possible to automate steps 3-6 via a wrapper script named mirrorzim.sh. It will download the latest snapshot of specified language (if needed), unpack it, and add it to IPFS.

To see how the script behaves try running it on one of the smallest wikis, such as cu:

$ ./mirrorzim.sh --languagecode=cu --wikitype=wikipedia --hostingdnsdomain=cu.wikipedia-on-ipfs.org

Docker build

A Dockerfile with all the software requirements is provided. For now it is only a handy container for running the process on non-Linux systems or if you don't want to pollute your system with all the dependencies. In the future it will be end-to-end blackbox that takes ZIM and spits out CID and repo.

To build the docker image:

docker build . -t distributed-wikipedia-mirror-build

To use it as a development environment:

docker run -it -v $(pwd):/root/distributed-wikipedia-mirror --net=host --entrypoint bash distributed-wikipedia-mirror-build

How to Help

If you don't mind command line interface and have a lot of disk space, bandwidth, or code skills, continue reading.

Share mirror CID with people who can't trust DNS

Sharing a CID instead of a DNS name is useful when DNS is not reliable or trustworthy. The latest CID for specific language mirror can be found via DNSLink:

$ ipfs resolve -r /ipns/tr.wikipedia-on-ipfs.org
/ipfs/bafy..

CID can then be opened via ipfs://bafy.. in a web browser with IPFS Companion extension resolving IPFS addresses via IPFS Desktop node.

You can also try Brave browser, which ships with native support for IPFS.

Cohost a lazy copy

Using MFS makes it easier to protect snapshots from being garbage collected than low level pinning because you can assign meaningful names and it won't prefetch any blocks unless you explicitly ask.

Every mirrored Wikipedia article you visit will be added to your lazy copy, and will be contributing to your partial mirror. , and you won't need to host the entire thing.

To cohost a lazy copy, execute:

$ export LNG="tr"
$ ipfs files mkdir -p /wikipedia-mirror/$LNG
$ ipfs files cp $(ipfs resolve -r /ipns/$LNG.wikipedia-on-ipfs.org) /wikipedia-mirror/$LNG/$LNG_$(date +%F_%T)

Then simply start browsing the $LNG.wikipedia-on-ipfs.org site via your node. Every visited page will be cached, cohosted, and protected from garbage collection.

Cohost a full copy

Steps are the same as for a lazy copy, but you execute additional preload after a lazy copy is in place:

$ # export LNG="tr"
$ ipfs refs -r /ipns/$LNG.wikipedia-on-ipfs.org

Before you execute this, check if you have enough disk space to fit CumulativeSize:

$ # export LNG="tr"
$ ipfs object stat --human /ipns/$LNG.wikipedia-on-ipfs.org                                                                                                                                 ...rror MM?fix/build-2021
NumLinks:       5
BlockSize:      281
LinksSize:      251
DataSize:       30
CumulativeSize: 15 GB

We are working on improving deduplication between snapshots, but for now YMMV.

Code

If you would like to contribute more to this effort, look at the issues in this github repo. Especially check for issues marked with the "wishlist" label and issues marked "help wanted".

More Repositories

1

ipfs

Peer-to-peer hypermedia protocol
22,646
star
2

kubo

An IPFS implementation in Go
Go
15,905
star
3

js-ipfs

IPFS implementation in JavaScript
JavaScript
7,438
star
4

ipfs-desktop

An unobtrusive and user-friendly desktop application for IPFS on Windows, Mac and Linux.
JavaScript
5,842
star
5

awesome-ipfs

Community list of awesome projects, apps, tools, pinning services and more related to IPFS.
JavaScript
4,307
star
6

ipfs-companion

Browser extension that simplifies access to IPFS resources on the web
JavaScript
2,010
star
7

public-gateway-checker

Checks which public gateways are online or not
TypeScript
1,697
star
8

ipfs-webui

A frontend for an IPFS node.
JavaScript
1,470
star
9

specs

Technical specifications for the IPFS protocol stack
HTML
1,107
star
10

helia

An implementation of IPFS in JavaScript
TypeScript
568
star
11

go-ipfs-api

The go interface to ipfs's HTTP API
Go
452
star
12

community

Discussion and documentation on community practices
Shell
416
star
13

notes

IPFS Collaborative Notebook for Research
402
star
14

go-ds-crdt

A distributed go-datastore implementation using Merkle-CRDTs.
Go
378
star
15

ipget

Retrieve files over IPFS and save them locally.
Shell
353
star
16

in-web-browsers

Tracking the endeavor towards getting web browsers to natively support IPFS and content-addressing
345
star
17

camp

🏕 IPFS Camp is a 3 day hacker retreat designed for the builders of the Distributed Web.
JavaScript
313
star
18

ipfs-docs

📚IPFS documentation platform
Go
299
star
19

roadmap

IPFS Project && Working Group Roadmaps Repo
296
star
20

team-mgmt

IPFS Team Planning, Management & Coordination threads
JavaScript
267
star
21

go-ds-s3

An s3 datastore implementation
Go
236
star
22

go-datastore

key-value datastore interfaces
Go
219
star
23

go-bitswap

The golang implementation of the bitswap protocol
Go
214
star
24

devgrants

The IPFS Grant platform connects funding organizations with builders and researchers in the IPFS community.
165
star
25

iptb

InterPlanetary TestBed 🌌🛌
Go
161
star
26

go-cid

Content ID v1 implemented in go
Go
156
star
27

js-ipfsd-ctl

Control an IPFS daemon (go-ipfs or js-ipfs) using JavaScript!
TypeScript
149
star
28

boxo

A set of reference libraries for building IPFS applications and implementations in Go.
Go
148
star
29

papers

IPFS Papers (not specs)
TeX
145
star
30

ipfs-update

An updater tool for Kubo IPFS binary
Go
136
star
31

infra

Tools and systems for the IPFS community
Shell
129
star
32

go-ipfs-http-client

[archived] Legacy Kubo RPC client, use kubo/client/rpc instead.
Go
108
star
33

go-unixfs

Implementation of a unix-like filesystem on top of an ipld merkledag
Go
107
star
34

ipfs-gui

Creating standards and patterns for IPFS that are simple, accessible, reusable, and beautiful
104
star
35

go-graphsync

Initial Implementation Of GraphSync Wire Protocol
Go
100
star
36

pinning-services-api-spec

Standalone, vendor-agnostic Pinning Service API for IPFS ecosystem
Makefile
99
star
37

aegir

AEgir - Automated JavaScript project building
JavaScript
93
star
38

js-dag-service

Library for storing and replicating hash-linked data over the IPFS network.
TypeScript
93
star
39

js-ipfs-unixfs

JavaScript implementation of IPFS' unixfs (a Unix FileSystem representation on top of a MerkleDAG)
TypeScript
85
star
40

go-merkledag

The go-ipfs merkledag 'service' implementation
Go
81
star
41

js-ipfs-repo

Implementation of the IPFS Repo spec in JavaScript
JavaScript
79
star
42

js-ipns

Utilities for creating, parsing, and validating IPNS records
TypeScript
74
star
43

js-datastore-s3

Datastore implementation with S3 backend
TypeScript
72
star
44

js-ipfs-bitswap

JavaScript implementation of Bitswap 'data exchange' protocol used by IPFS
TypeScript
65
star
45

go-ipld-format

IPLD Node and Resolver interfaces in Go
Go
61
star
46

apps

Coordinating writing apps on top of ipfs, and their concerns.
59
star
47

go-log

A logging library used by go-ipfs
Go
56
star
48

go-ipld-git

ipld handlers for git objects
Go
54
star
49

fs-repo-migrations

Migrations for the filesystem repository of ipfs clients
Go
54
star
50

go-dnslink

dnslink resolution in go-ipfs
Go
53
star
51

go-ds-badger

Datastore implementation using badger as backend.
Go
53
star
52

go-ipfs-blockstore

[ARCHIVED] This module provides a thin wrapper over a datastore and provides caching strategies.
Go
49
star
53

ipfs-blog

IPFS Blog & News
Vue
48
star
54

distributions

Legacy dist.ipfs.tech website and artifact build tools. Currently only used for notarizing builds of Kubo and IPFS Cluster.
Less
48
star
55

go-ipfs-cmds

IPFS commands package
Go
47
star
56

go-mfs

An in memory model of a mutable IPFS filesystem
Go
46
star
57

local-offline-collab

Local Offline Collaboration Special Interest Group
46
star
58

newsletter

Prepare and store the IPFS Newsletter
44
star
59

go-ds-flatfs

A datastore implementation using sharded directories and flat files to store data
Go
44
star
60

go-ipld-eth

Plugin of the Go IPFS Client for Ethereum Blockchain IPLD objects
Go
43
star
61

dht-node

[ARCHIVED] Run just an ipfs dht node (Or many nodes at once!)
Go
41
star
62

npm-kubo

Install Kubo (go-ipfs) from NPM
JavaScript
40
star
63

go-ipns

Utilities for creating, parsing, and validating IPNS records
Go
39
star
64

interface-go-ipfs-core

[ARCHIVED] this interface is now part of boxo and kubo/client/rpc
Go
38
star
65

dir-index-html

Directory listing HTML for go-ipfs gateways
HTML
38
star
66

rainbow

A specialized IPFS HTTP gateway
Go
37
star
67

go-ds-leveldb

An implementation of go-datastore using leveldb
Go
35
star
68

go-ipfs-files

An old files library, please migrate to `github.com/ipfs/go-libipfs/files` instead.
Go
33
star
69

interop

Interoperability tests for IPFS Implementations (on-the-wire interop)
JavaScript
32
star
70

ipfs-website

Official IPFS Project website
Vue
32
star
71

protons

Protocol Buffers for Node.js and the browser without eval
TypeScript
32
star
72

go-ipld-cbor

A cbor implementation of the go-ipld-format
Go
31
star
73

go-ds-sql

An implementation of ipfs/go-datastore that can be backed by any SQL database.
Go
31
star
74

go-ipfs-chunker

go-ipfs-chunkers provides Splitter implementations for data before being ingested to IPFS
Go
31
star
75

go-blockservice

The go 'blockservice' implementation, combines local and remote storage seamlessly
Go
27
star
76

service-worker-gateway

[WIP EXPERIMENT] IPFS Gateway implemented in Service Worker
TypeScript
25
star
77

go-ipld-eth-import

🌐 Bring Ethereum to IPFS 🌐
Go
25
star
78

ipld-explorer-components

React components for https://explore.ipld.io and ipfs-webui
TypeScript
24
star
79

interface-datastore

datastore interface
JavaScript
22
star
80

js-ipfs-utils

IPFS utils
JavaScript
22
star
81

metrics

Regularly collect and publish metrics about the IPFS ecosystem
JavaScript
21
star
82

js-ipfs-merkle-dag

[DEPRECATED]
JavaScript
20
star
83

js-datastore-level

Datastore implementation with level(up/down) backend
TypeScript
19
star
84

go-ipfs-example-plugin

Demo plugin for Kubo IPFS daemon
Go
19
star
85

benchmarks

Benchmarking for IPFS
JavaScript
19
star
86

ipfs-ds-convert

Command-line tool for converting datastores (e.g. from FlatFS to Badger)
Go
18
star
87

go-ipfs-provider

Go
17
star
88

js-kubo-rpc-client

A client library for the Kubo RPC API
JavaScript
17
star
89

go-pinning-service-http-client

An IPFS Pinning Service HTTP Client
Go
17
star
90

go-ipfs-gateway

Go implementation of the HTTP-to-IPFS gateway -- currently lives in go-ipfs
16
star
91

go-ipld-zcash

An implementation of the zcash block and transaction datastructures for ipld
Go
16
star
92

ipfs-camp-2022

Conference content and other resources for IPFS Camp 2022 in Lisbon, Portugal
16
star
93

go-ipfs-config

[ARCHIVED] config is now part of go-ipfs repo
Go
16
star
94

mobile-design-guidelines

Making IPFS work for mobile
15
star
95

js-datastore-core

Contains various implementations of the API contract described in interface-datastore
TypeScript
15
star
96

ecosystem-directory

Interactive showcase of projects and products built using IPFS, the InterPlanetary File System.
HTML
15
star
97

js-datastore-fs

Datastore implementation with file system backend
TypeScript
14
star
98

go-namesys

go-namesys provides publish and resolution support for the /ipns/ namespace in go-ipfs
Go
14
star
99

helia-unixfs

UnixFS commands for helia
TypeScript
14
star
100

js-stores

TypeScript interfaces used by IPFS internals
TypeScript
14
star