• Stars
    star
    127
  • Rank 281,239 (Top 6 %)
  • Language
    Python
  • Created almost 15 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tools to convert Wikipedia dumps into Git repositories.

PLEASE NOTE: THIS PROJECT IS NO LONGER MAINTAINED

The current state of the Levitation project

Levitation is a software project which, as a preparation for a decentralized Wikipedia, converts MediaWiki XML dump files into Git repositories, creating a Git commit for each wiki edit.

It has been abandoned by its original author, Tim Weber (aka Scytale or scy, the same person who wrote this overview page you’re currently reading), in 2009. The reason for that was that Tim lacked the time to work on it any further, and no significant contributions from other people were made.

However, since then, other people have worked on the codebase, since the original Leviation is free software and anyone is allowed to modify it:

If you have questions about Levitation, want to work on it or need support, I’d prefer you ask one of the two other guys first. You can also try contacting me (see http://scy.name/contact), but I might refuse to help (or even reply to) you if it takes too much time or effort. I am no longer maintaining Levitation anymore, I’m sorry.

This note was originally written in March 2015 and updated in January 2017.

PLEASE NOTE: THIS PROJECT IS NO LONGER MAINTAINED

What follows is the last version of the README, updated to the project's abandoned state.

This is Levitation, a project to convert Wikipedia database dumps into Git
repositories. It has been successfully tested with a small Wiki
(bar.wikipedia.org) having 12,200 articles and 104,000 revisions. Importing
those took 6 minutes on a Core 2 Duo 1.66 GHz. RAM usage is minimal: Pages
are imported one after the other, it will at most require the amount of memory
needed to keep all revisions of a single page into memory. You should be safe
with 1 GB of RAM.

See below (“Things that work”) for the status at the time it was abandoned.

Some knowledge of Git is required to use this tool. And you will probably
need to edit some variables in the source code.

You need at least a 2.5 Python, my tests run with 2.6.


How it should be done:

You can get recent dumps of all Wikimedia wikis at:
http://download.wikimedia.org/backup-index.html

The pages-meta-history.xml file is what we want. (In case you’re wondering:
Wikimedia does not offer content SQL dumps anymore, and there are now full-
history dump for en.wikipedia.org because of its size.) It includes all pages
in all namespaces and all of their revisions.

Alternatively, you may use a MediaWiki’s “Special:Export” page to create an XML
dump of certain pages.


Things that work:

 - Read a Wikipedia XML full-history dump and output it in a format suitable
   for piping into git-fast-import(1). The resulting repository contains one
   file per page. All revisions are available in the history. There are some
   restrictions, read below.

 - Use the original modification summary as commit message.

 - Read the Wiki URL from the XML file and set user mail addresses accordingly.

 - Use the author name in the commit instead of the user ID.

 - Store additional information in the commit message that specifies page and
   revision ID as well as whether the edit was marked as “minor”.

 - Use the page’s name as file name instead of the page ID. Non-ASCII
   characters and some ASCII ones will be replaced by “.XX”, where .XX is their
   hex value.

 - Put pages in namespace-based subdirectories.

 - Put pages in a configurably deep subdirectory hierarchy.

 - Use command line options instead of hard-coded magic behavior. Thanks to
   stettberger for adding this.

 - Use a locally timezoned timestamp for the commit date instead of an UTC one.


Things that are still missing:

 - Allow IPv6 addresses as IP edit usernames. (Although afaics MediaWiki itself
   cannot handle IPv6 addresses, so we got some time.)


Things that are strange:

 - Since we use subdirectories, the Git repo is no longer larger than the
   uncompressed XML file, but instead about 30% of it. This is good. However,
   it is still way larger than the bz2 compressed file, and I don’t know why.


Things that are cool:

 - “git checkout master~30000” takes you back 30,000 edits in time — and on my
   test machine it only took about a second.

 - The XML data might be in the wrong order to directly create commits from it,
   but it is in the right order for blob delta compression: When passing blobs
   to git-fast-import, delta compression will be tried based on the previous
   blob — which is the same page, one revision before. Therefore, delta
   compression will succeed and save you tons of storage.


Example usage:

Please note that there’s the variable IMPORT_MAX, right at the beginning of
import.py. By default it’s set to 100, so Levitation will only import 100
pages, not more. This protects you from filling your disk when you’re too
impatient. ;) Set it to -1 when you’re ready for a “real” run.

This will import the pdc.wikipedia.org dump into a new Git repository “repo”:

  rm -rf repo; git init --bare repo && \
    ./import.py < ~/pdcwiki-20091103-pages-meta-history.xml | \
    GIT_DIR=repo git fast-import | \
    sed 's/^progress //'

Execute “import.py --help” to see all available options.


Storage requirements:

“maxrev” be the highest revision ID in the file.

“maxpage” be the highest page ID in the file.

“maxuser” be the highest user ID in the file.

The revision metadata storage needs maxrev*17 bytes.

The revision comment storage needs maxrev*257 bytes.

The author name storage needs maxuser*257 bytes.

The page title storage needs maxpage*257 bytes.

Those files can be deleted after an import.

Additionally, the content itself needs some space. My test repo was about 15%
the size of the uncompressed XML, that is about 300% the size of the bz2
compressed XML data (see “Things that are strange”).

Note that if you want to check out a working copy, the filesystem it will be
living on needs quite a few free inodes. If you get “no space left on device”
errors with plenty of space available, that’s what hit you.


Contacting the author:

This monster was written by in 2009 by Tim “Scytale” Weber (today aka “scy”). It
was an experiment, whether the “relevance war” in the German Wikipedia at that
time can be ended by decentralizing content. It is no longer actively maintained
by me.


This whole bunch of tasty bytes is licensed under the terms of the WTFPLv2.

More Repositories

1

timesheet.txt

A plain-text timesheet file format and tools for it.
Python
56
star
2

knowledge

scy’s personal collection of knowledge.
30
star
3

dotscy

My personal configuration files, scripts etc.
Shell
21
star
4

forscript

Design and implementation of a forensic documentation tool for interactive command-line sessions. My bachelor thesis.
9
star
5

rem2ics

A Git mirror of Mark Atwood’s rem2ics project.
8
star
6

javascript-talk

My talk about modern JavaScript
JavaScript
7
star
7

advent-of-code

My solutions to Advent of Code 2017 to 2019. I’ve moved this repo to Codeberg, see you there!
Rust
7
star
8

autosnap

Tools for automatic capturing and storage of images.
Shell
6
star
9

jessie

Jessie is the van of @scy and also his home.
Python
6
star
10

calproxy

Mirror an iCal feed and create a redacted free/busy version.
Go
6
star
11

autoborg

Wrapper around Borg Backup that simplifies things and has sane defaults
Shell
6
star
12

scybtr

some of my tools for btrfs
Shell
5
star
13

scyDE

Extended German keyboard layout for programmers and typophiles.
5
star
14

qb

A small, fast and database-less blogging system
PHP
5
star
15

votonic

Reverse engineered Votronic RS485 protocol.
Python
5
star
16

kopfspuring

Wiimote head tracking stuff
Python
4
star
17

gitover

Send Pushover notifications for new Git commits
Shell
4
star
18

vim-timesheet

Vim plugin for timesheet.txt files
Vim Script
4
star
19

sihaya

16-color terminal themes based on the One Dark Pro color theme for Visual Studio Code.
Python
4
star
20

dotfiles

My personal configuration files and helper scripts.
Shell
4
star
21

timweber.name

My business-oriented web site.
CSS
3
star
22

abcdef

A Better CD Encoder, Forked
Shell
3
star
23

bn

Betriebssysteme und Netze
HTML
3
star
24

tweethub

A lightweight GitHub to Twitter gateway using web hooks.
PHP
3
star
25

supertest-scy

My extensions to SuperTest
JavaScript
3
star
26

scylec

Create university lectures with slides, presentation notes and a manuscript from a single file.
JavaScript
3
star
27

crowsnest

Statistics about the international political parties.
Shell
3
star
28

tweetd

A Python tool to mirror tweets locally and run queries against them.
Python
3
star
29

nscp

Nagios SSH Command Pipe
3
star
30

dretweet

dretweet retweets DMs sent to a Twitter account for remote control.
PHP
3
star
31

bitsfs

BitsFS is a FUSE-based virtual file system which provides a single virtual big file which will be stored as a lot of small physical files representing equally sized chunks of the big file.
C
3
star
32

vim-mkdir-on-write

Vim plugin to automatically create required parent directories when saving.
Vim Script
3
star
33

vim-remind

Vim syntax for remind
Vim Script
3
star
34

valfahrt.com

The code behind the website of a crazy travel group.
PHP
2
star
35

git-intro

An introduction talk about Git
JavaScript
2
star
36

fritzbox-hev6-update

Update a tunnelbroker.net (Hurricane Electric) IPv6 tunnel when your Fritz!Box public IP changes.
Shell
2
star
37

hilink-php

PHP SDK for Huawei HiLink devices. Work in Progress.
PHP
2
star
38

php-nmea

PHP library for NMEA 0183 reading, handling and producing
PHP
2
star
39

git-talk

My talk “Git for everyone”, currently German only.
2
star
40

rechnerstrukturen

Material for the lecture “Rechnerstrukturen”. In German.
Shell
2
star
41

mastermind

A framework for calendar exports using Roaring Penguin’s “remind”.
Shell
2
star
42

BaToReL

Python
2
star
43

dmenu

Git conversion of the dmenu hg repo.
C
2
star
44

universe

my home server setup
ApacheConf
2
star
45

fmenu

_F_ork of dmenu with more _f_eatures. NO LONGER ACTIVELY DEVELOPED. CONTACT ME IF YOU WANT TO BECOME A MAINTAINER.
C
2
star
46

viergewinnt

A Node.js based "connect four" implementation with a twist.
JavaScript
2
star
47

ideash

POSIX Shell Language Support for IDEA-based IDEs
Java
1
star
48

c6

AVR C library for µC connectivity
C
1
star
49

brchina

Projekt "Bundesrepublik China"
Shell
1
star
50

etavc

Emoticons that are Vim commands. The website.
1
star
51

twarph

A Windows Mobile (.NET CF) Twitter client with search functionality.
1
star
52

infrastructure

configuration files and setup scripts for my personal infrastructure
Shell
1
star
53

hottopod

a small Node.js web server utility and framework
JavaScript
1
star
54

me

Things you might want to know about me.
1
star
55

weihnachtsgruss

My first(?) piece of code. Around 1988.
Visual Basic
1
star
56

card10slide

Displays images in a looping slideshow on the card10 badge.
Python
1
star
57

homn

Moved to src/homn directory in my "jessie" repo
C
1
star
58

bleagg

Bluetooth Low Energy aggregeation (and sending to IoTPlotter)
Python
1
star
59

jquery.regexpclass

Allows $.hasClass and $.removeClass to receive a RegExp parameter.
JavaScript
1
star
60

spot-log

Retrieve SPOT Satellite Tracking System positions
Go
1
star
61

unicompose

Cross-platform Compose key setup to enter special characters.
Python
1
star
62

wardrobe

A wrapper for rdiff-backup
Python
1
star
63

perthensis

An asynchronous framework for MicroPython
Python
1
star
64

unexciting

An unexciting 16-color scheme.
Python
1
star
65

scytale.name

My private web site.
PHP
1
star
66

zensus11

Material for the European Census 2011 in Germany
1
star
67

spackeria

Website of the Spackeria.
HTML
1
star
68

krebskandidat

A fancy particulate sensor and cigarette smoke detector
Python
1
star
69

ScyMultilang

A small Zend Framework library to make multi-language sites and translated URLs easier.
PHP
1
star
70

val

New website of val-sainte-marie.de
PHP
1
star