• Stars
    star
    169
  • Rank 223,436 (Top 5 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 4 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An exploration to host Wikipedia in IPFS

Wikipedia-IPFS

An exploration to host Wikipedia in IPFS. This project contains code to extract content from wikipedia and add to IPFS and documentation of the proposed architecture. This is just a proof of concept and not ready for any serious use.

Introduction

IPFS is a protocol for building distributed web. Wikipedia is currently hosted in its servers. To decentralize the sum of all human knowledge, we need to host and maintain all such knowledge in a distributed network. There are many candidates for such distributed web protocol. IPFS, DAT are some examples. None of them are highly popular among common internet users, but they are in more or less active development.

IPFS had attempted to host the Turkish wikipedia a few years back. It is based on static snapshot of wikipedia pages - basically static html files. If somebody update the hosted snapshot, users get that snapshot. But wikipedia is very dynamic. Thousands of edits happens every day. New articles are created every time.

If you are not already familiar with the concepts of distributed web and IPFS, please have some background reading about them to better understand this document.

Goals

  1. Every Wikipedia content revisions as objects in distributed web. They are content addressable: This is basic units of content in wikipedia world. Each revision, once created, is immutable. There is no way you can change it. Each revision will have content associated with it and some metadata such as who create it and when.
  2. Every wikipedia article having an object in distributed web with pointers to its revisions: An article in wikipedia is editable. The latest revision represent the current state of the article. But one can always access its old revisions at any point. It is desirable to have a human readable name along with IPFS hash id for each article.
  3. Every wikipedia having an object in distributed web with addresses of its articles: A wikipedia is a collection of articles(But not limited to). So a wikipedia like English Wikipedia is kind of a registry with listing of all its articles. (In case you are wondering why I mention each wikipedia when there is a single wikipedia - You may not know this, but there are wikipedia in nearly 300 languages. English Wikipedia, Spanish Wikipedia, Tamil Wikipedia are examples)
  4. A Wikipedia reading web application that can live in a distributed web: To make the content in distributed web usable or consumable, we need a wikipedia reading and possibly editing interface. This application presents the content for human conception.
  5. An editor that lives in distributed web This editor adds or edits content and publish to IPFS

Architecture

In my previous attempt, I was trying to model everything using files and using "files" feature of IPFS. You may read that approach in README.old.md in this repo. After I published it, many people contacted me to discuss these concepts. From all those discussions, I found that, it is better to model the content as Linked Data. It gives easier path towards semantic knowledge(a concept I am very much interested). So in this approach, I am using IPLD - Inter Planetory Linked Data.

The proposed architecture with four components - Feeder, Publisher, Editor, Reader. We will explain each of them in detail in this document.

Feeder

For detailed documentation, see packages/feeder

This component adds content from current wikipedia to IPFS in massive scale. An implementation of this is available in this repository. See packages/feeder folder.

Based on the articles that were edited recently(using edit event stream), all available information about the article, its revisions are fetched from Wikipedia APIS. This structured information is then transformed to an IPFS DAG. In otherwords, we represent the JSON formatted API result into an IPLD - Inter Planetary Linked Data.

This package also provide ways to programmatically create article nodes based on any other lists or categories. This is a bridge between current wikipedia and IPLD.

When an article is added to IPLD, it publishes a message in PUBSUB with a topic. The message contains the CID of the article and its title. It does not add this article to any wikipedia in IPLD. Adding the article to wikipedia and tracking it using the CID is done using the Publisher component explained below.

The main reason behind this independent addition of articles to IPLD is because publishing and tracking need its own "authority control" or keys. Secondly, we need a lot of such feeders because there are too many edits happening in all of 300+ wikis to listen and add to IPLD. Since we are talking about distributed wikipedia, its components should also be truly distributed too, right? There is no need to worry about duplication in IPFS anyway since all are content addressed.

The following image shows a real article from Malayalam wikipedia published.

You may explore this node using IPLD explorer. https://explore.ipld.io/#/explore/bafyreifs7kodvs4qamc2e5fdgzqaganabn5t36pzqgijhiqa3t53az5tg4

A revision node will look like the below image. You can access it using the IPLD explorer link: https://explore.ipld.io/#/explore/zdpuAxU6hAxiU47Ga9HU6ok1vKztVagns5AxurAJMnVJEWJBQ/revisions/3306691

Publisher

For detailed documentation, see packages/publisher

This component keep track of articles in a wikipedia. Publisher publishes a tracker to IPLD for every wikipedia. It has titles as key and CID of article. The entries in this tracker is collected by subscribing the article create/edit messages published by Feeder(also Editor as explained below). Once an article CID changes, the CID of that wikipedia also changes. Knowing the latest CID of a particular wikipedia is important to access latest article in that wikipedia.

For this, Publisher publishes the CID and wikiname to IPFS PUBSUB.

But there are many wikis. We need a tracker for tracking all these wikis. So publisher maintains a Wikipedia tracker. It contains wikipedia name and its latest CID.

Since CIDs keeps on changing for every edit, for a human to access, a stable name is required, also known as IPNS. The publisher program tries to update the IPNS to point to the latest CID. Currently this is not an accurate process since IPNS updating is a very slow process. As IPFS improves the IPNS performance, our program will be more accurate.

But, to overcome the difficulties of slow IPNS, the publisher program broadcasts the latest CID in a IPFS PUBSUB topic 'wikipedia/cid'.

How many such publishers are required? We can have any number of publishers, but only one publisher can have the key to publish the IPNS of the wikipedia IPLD universe. If another publisher create IPNS from CID, that will be different.

The publisher can also do some more "editorial" roles such as authenticating the article publish messages with a user's certificate or key(depends how we design it). It can do some validation on article IPLD based on a schema or validation rules. It can have spam detection and so on. Theoretically this opens up a possibility of multiple Wikipedias existing in IPLD with different editorial policies. This is an interesting outcome, I have not fully thought about the implications.

Editor

Wikipedia is editable. Editing an article in IPLD and publishing new revision is possible. This is similar to what we did in Feeder. The editor can be anything as long as it create a new valid article IPLD. New CID of this article is then published in IPFS PUBSUB. Somewhere, a publisher will pick this up and decide to add to the wikipedia tracker. Ideally, the editor will be part of a reader application

Will that edit get reflected in non-distributed wikipedia? I don't know.

Reader

A reader application resolve the IPNS of Wikipedia IPLD to get current CID or/and subscribe to the IPFS PUBSUB to get latest CID. Then get the content of the article by traversing to wikis tracker and then to the article tracker. Get latest CID of the article and render to a user.

This application should also be hosted in IPFS or available locally in users devices.

In the past, I(Santhosh) had attempted to build a static web application that can be hosted in distributed web. I have placed this application in IPFS. See https://bafzbeigwtdcnrxx34bkdfxvcw2vtwibwij3vcrqthahomcajxkcm6ddlka.ipns.dweb.link/

Alteratively this application can be run from desktop or mobile(it is a Progressive web app). Anyway, some work is required in this front, but there is a proof of concept. It currently uses the wikipedia REST API and need to rewire to take content from distributed web.

The in browser js-ipfs apis are in active development and not ready for this usecase from my testing. IPNS resolving is not possible with the latest version

Permanent address

If every edit change the CID or hash of wiki, how do we refer it in a permanent way? IPFS provides a way for this - It is name IPNS(Inter Planetory Naming System)

"Inter-Planetary Name System (IPNS) is a system for creating and updating mutable links to IPFS content. Since objects in IPFS are content-addressed, their address changes every time their content does. Thatโ€™s useful for a variety of things, but it makes it hard to get the latest version of something. A name in IPNS is the hash of a public key. It is associated with a record containing information about the hash it links to that is signed by the corresponding private key. New records can be signed and published at any time."

So every wikipedia, in addition to its ipfs/CID address, there will be an IPNS address like /ipns/QwxoosidSOKWms... If that is not readable DNSLink comes handy and we can have addresses like /ipns/en.wikipedia.org.

Search

The IPLD based representation of knowledge is usable only if people can easily search the content. The search is not just about keywords, but semantic querying like we do using SPARQL. In this exploration, the data in IPLD is not strictly based on any RDF. But it can be. If we can represent the data in RDF, can we have a SPARQL implementation for IPLD?

Beyond wikipedia

While working on this exploration and studing IPLD and multihash and multiformats, I started thinking about linking all non-wikipedia knowledge structures also part of this IPLD. IPLD allows linking to independent IPLDs existing in IPFS - they can be any IPLD compatible formats such as Git. Also, if there are educational and knowledge resources existing in IPLD, it is quite trivial to link them to article IPLD. I talked about wikipedia and articles in this exploration, but wikidata information associated with each article can be easily linked to article IPLD.

Disclaimer

Even though the author is an Engineer at Wikimedia foundation, this is not an official Wikimedia foundation project.

More Repositories

1

pypdflib

Pango Cairo based PDF rendering library. Supports complex scripts. Written in Python. Mirror of http://gitorious.org/pypdflib
Python
47
star
2

AutonymFont

[UNMAINTAINED] A font that can render all language autonyms
38
star
3

wikivue

Vuejs powered modern, single page, progressive, offline capable web application for Wikipedia
Vue
35
star
4

CLDRPluralRuleParser

CLDR Plural Rule Parser
JavaScript
18
star
5

silpa

Indian Language Computing Project
Python
18
star
6

sfst

Stuttgart Finite State Transducer system
C++
16
star
7

telegram-rss-reader

A telegram bot to read RSS feeds
Python
14
star
8

meera-tamil

Meera Tamil Unicode font, Now renamed as Meera Inimai and moved to https://gitlab.com/smc/meera-inimai/
HTML
14
star
9

wiki2cd

[ABANDONED] Tool to create an offline repository or CD from a selected list of topics from wikipedia
JavaScript
13
star
10

nupuram

Nupuram/เดจเต‚เดชเตเดฐเด‚ Font - https://smc.org.in/fonts/nupuram
HTML
13
star
11

type-concepts

Typeface design concepts illustrated using metapost
11
star
12

spellchecker-webservice

Spellchecker service based on hunspell for 90 languages
JavaScript
10
star
13

Manjari

Manjari Malayalam Font.
Makefile
10
star
14

mlmash

CSS
9
star
15

vue-banana-i18n

JavaScript
8
star
16

docs

8
star
17

uca.js

Unicode Collation Algorithm- Javascript implementation
JavaScript
8
star
18

wq

An experimental natural language based querying system for Wikipedia
Python
7
star
19

tofudetector

Measure the rendering capacity for language scripts in a client browser context by detecting tofu
JavaScript
6
star
20

php-silpa-spellchecker

PHP client for silpa spellchecker service
PHP
5
star
21

tesseract-web

Web Interface for Tesseract OCR, with docker support
CSS
5
star
22

vscode-afdko

Syntax highlighting for OpenType feature files in the Adobe AFDKO format
5
star
23

hyphenation

Hyphenation patterns for Indian languages.
Makefile
5
star
24

hand

Handwriting recognition
Python
5
star
25

metapost-sandbox

Try Metapost quickly and easily with our online sandbox application!
JavaScript
5
star
26

fontmovie

Manjari typeface - All glyphs
JavaScript
5
star
27

vscode-sfst

Syntax highlighting support for Stuttgart Finite State Transducer (SFST) formalism to VS Code.
4
star
28

inkscape-hyphenation

Python
4
star
29

cxdebugger

Content translation debugger
JavaScript
4
star
30

malayalam-conjuncts

A list of Malayalam conjuncts. Reference: Rachana font
4
star
31

unicode-analyser

A simple unicode analyser GUI
Python
4
star
32

santhoshtr

3
star
33

website

JavaScript
3
star
34

recentcx

Vue
3
star
35

malayalam-syllable-analyser

Malayalam Syllable model using PEG and analyser
JavaScript
2
star
36

m17n-db

M4
2
star
37

py-nilsimsa

Automatically exported from code.google.com/p/py-nilsimsa
Python
2
star
38

Dyuthi

Dyuthi
2
star
39

cxsectiontitlemapping

Parse Content translation parallel corpus to extract all section title mappings
Python
2
star
40

cxdashboard

Less
2
star
41

CXWatch

A desktop application to watch Wikipedia article creation, uses RCFeed, Node-webkit.
CSS
2
star
42

lexicon-curator

Vue
2
star
43

banana-i18n-loader

Webpack loader for Banana-i18n
JavaScript
2
star
44

lid-kserve

Python
1
star
45

uax31

PHP implementation of UAX 31- Unicode Identifier and Pattern Syntax
1
star
46

llm-web

Dockerized Flask based application for BLOOM optimized with CTranslate2
Python
1
star
47

uls-i18n-example

An example showing ULS and jQuery.i18n
JavaScript
1
star
48

mediawiki-html-sanitizer

Strips out (or encapsulates) unsafe and disallowed tag types and attributes in Mediawiki HTML spec.
JavaScript
1
star
49

AnjaliOldLipi

AnjaliOldLipi
1
star
50

vue3-banana-i18n-example

Vue 3 example application with Banana-i18n plugin
Vue
1
star
51

python-banana-i18n

Forked from https://git.legoktm.com/legoktm/banana-i18n
Python
1
star
52

brackets-esformatter

ECMAScript code beautifier/formatter for Brackets
1
star
53

soundex-spellcheck-js

Spellchecking based on soundex algorithm
JavaScript
1
star
54

mediawiki-extensions-ApiSVGProxy

Github mirror of MediaWiki extension ApiSVGProxy - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing)
PHP
1
star
55

flask-banana-i18n

Forked from https://git.legoktm.com/legoktm/banana-i18n
Python
1
star
56

malayalam-digital-aesthetics

Presentation for Puthusseri Ramachandran Memorial Seminar - Kerala University
SCSS
1
star