• Stars
    star
    229
  • Rank 168,931 (Top 4 %)
  • Language
    JavaScript
  • License
    Other
  • Created over 9 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

roll a wikipedia dump into mongo

dumpster-dive

wikipedia dump parser
by Spencer Kelly, Devrim Yasar, and others

gets a wikipedia xml dump into mongo,
so you can mess-around.

πŸ’‚ Yup πŸ’‚

do it on your laptop.

dumpster-dive is a node script that puts a highly-queryable wikipedia on your computer in a nice afternoon.

It uses worker-nodes to process pages in parallel, and wtf_wikipedia to turn wikiscript into any json.

-- en-wikipedia takes about 5-hours, end-to-end --

dumpster

this library writes to a database,

if you'd like to simply write files to the filesystem, use dumpster-dip instead.

npm install -g dumpster-dive

😎 API

var dumpster = require('dumpster-dive');
dumpster({ file: './enwiki-latest-pages-articles.xml', db: 'enwiki' }, callback);

Command-Line:

dumpster /path/to/my-wikipedia-article-dump.xml --citations=false --images=false

then check em out in mongo:

$ mongo        #enter the mongo shell
use enwiki     #grab the database
db.pages.count()
# 4,926,056...
db.pages.find({title:"Toronto"})[0].categories
#[ "Former colonial capitals in Canada",
#  "Populated places established in 1793" ...]

Steps:

1️⃣ you can do this.

you can do this. just a few Gb. you can do this.

2️⃣ get ready

Install nodejs (at least v6), mongodb (at least v3)

# install this script
npm install -g dumpster-dive # (that gives you the global command `dumpster`)
# start mongo up
mongod --config /mypath/to/mongod.conf

3️⃣ download a wikipedia

The Afrikaans wikipedia (around 93,000 articles) only takes a few minutes to download, and 5 mins to load into mongo on a macbook:

# download an xml dump (38mb, couple minutes)
wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2

the english dump is 16Gb. The download page is confusing, but you'll want this file:

wget https://dumps.wikimedia.org/${LANG}wiki/latest/${LANG}wiki-latest-pages-articles.xml.bz2

for example, the English version is:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

4️⃣ unzip it

i know, this sucks. but it makes the parser so much faster.

bzip2 -d ./afwiki-latest-pages-articles.xml.bz2

On a macbook, unzipping en-wikipedia takes an hour or so. This is the most-boring part. Eat some lunch.

The english wikipedia is around 60Gb.

5️⃣ OK, start it off

#load it into mongo (10-15 minutes)
dumpster ./afwiki-latest-pages-articles.xml

6️⃣ take a bath

just put some epsom salts in there, it feels great.

The en-wiki dump should take a few hours. Maybe 8. Maybe 4. Have a snack prepared.

The console will update you every couple seconds to let you know where it's at.

7️⃣ done!

image

hey, go check-out your data - hit-up the mongo console:

$ mongo
use afwiki //your db name

//show a random page
db.pages.find().skip(200).limit(2)

//find a specific page
db.pages.findOne({title:"Toronto"}).categories

//find the last page
db.pages.find().sort({$natural:-1}).limit(1)

// all the governors of Kentucky
db.pages.count({ categories : { $eq : "Governors of Kentucky" }}

//pages without images
db.pages.count({ images: {$size: 0} })

alternatively, you can run dumpster-report afwiki to see a quick spot-check of the records it has created across the database.

Same for the English wikipedia:

the english wikipedia will work under the same process, but the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 13 GB (for enwiki-20170901-pages-articles.xml.bz2), and becomes a pretty legit mongo collection uncompressed. It's something like 51GB, but mongo can do it πŸ’ͺ.

Options:

dumpster follows all the conventions of wtf_wikipedia, and you can pass-in any fields for it to include in it's json.

  • human-readable plaintext --plaintext
dumpster({ file: './myfile.xml.bz2', db: 'enwiki', plaintext: true, categories: false });
/*
[{
  _id:'Toronto',
  title:'Toronto',
  plaintext:'Toronto is the most populous city in Canada and the provincial capital...'
}]
*/
  • disambiguation pages / redirects --skip_disambig, --skip_redirects by default, dumpster skips entries in the dump that aren't full-on articles, you can
let obj = {
  file: './path/enwiki-latest-pages-articles.xml.bz2',
  db: 'enwiki',
  skip_redirects: false,
  skip_disambig: false
};
dumpster(obj, () => console.log('done!'));
  • reducing file-size: you can tell wtf_wikipedia what you want it to parse, and which data you don't need:
dumpster ./my-wiki-dump.xml --infoboxes=false --citations=false --categories=false --links=false
  • custom json formatting you can grab whatever data you want, by passing-in a custom function. It takes a wtf_wikipedia Doc object, and you can return your cool data:
let obj = {
  file: path,
  db: dbName,
  custom: function (doc) {
    return {
      _id: doc.title(), //for duplicate-detection
      title: doc.title(), //for the logger..
      sections: doc.sections().map((i) => i.json({ encode: true })),
      categories: doc.categories() //whatever you want!
    };
  }
};
dumpster(obj, () => console.log('custom wikipedia!'));

if you're using any .json() methods, pass a {encode:true} in to avoid mongo complaints about key-names.

  • non-main namespaces: do you want to parse all the navboxes? change namespace in ./config.js to another number

  • remote db: if your databse is non-local, or requires authentication, set it like this:

dumpster({ db_url: 'mongodb://username:password@localhost:27017/' }, () => console.log('done!'));

how it works:

this library uses:

Addendum:

_ids

since wikimedia makes all pages have globally unique titles, we also use them for the mongo _id fields. The benefit is that if it crashes half-way through, or if you want to run it again, running this script repeatedly will not multiply your data. We do a 'upsert' on the record.

encoding special characters

mongo has some opinions on special-characters in some of its data. It is weird, but we're using this standard(ish) form of encoding them:

\  -->  \\
$  -->  \u0024
.  -->  \u002e

Non-wikipedias

This library should also work on other wikis with standard xml dumps from MediaWiki (except wikidata!). I haven't tested them, but the wtf_wikipedia supports all sorts of non-standard wiktionary/wikivoyage templates, and if you can get a bz-compressed xml dump from your wiki, this should work fine. Open an issue if you find something weird.

did it break?

if the script trips at a specific spot, it's helpful to know the article it breaks on, by setting verbose:true:

dumpster({
  file: '/path/to/file.xml',
  verbose: true
});

this prints out every page's title while processing it..

16mb limit?

To go faster, this library writes a ton of articles at a time (default 800). Mongo has a 16mb limit on writes, so if you're adding a bunch of data, like latex, or html, it may make sense to turn this down.

dumpster --batch_size=100

that should do the trick.

PRs welcome!

This is an important project, come help us out.

More Repositories

1

compromise

modest natural-language processing
JavaScript
11,140
star
2

spacetime

A lightweight javascript timezone library
JavaScript
3,822
star
3

wtf_wikipedia

a pretty-committed wikipedia markup parser
JavaScript
739
star
4

unrequired

find unused javascript files in your project
JavaScript
109
star
5

Freebase.js

inference and inspection on freebase data
JavaScript
107
star
6

efrt

neato compression for key-value data
JavaScript
90
star
7

famousd3

get famo.us to render d3js components
JavaScript
27
star
8

somehow-graph

Svelte infographics component
JavaScript
27
star
9

clooney

a graphing library in the famo.us engine
JavaScript
20
star
10

timezone-soft

parse informal timezone names
JavaScript
20
star
11

out-of-character

remove invisible unicode characters
JavaScript
17
star
12

thensome

i guess we'll find out.
JavaScript
15
star
13

spacetime-geo

determine date/time using geo-location
JavaScript
11
star
14

somehow

a number of Svelte infographics
9
star
15

fit-aspect-ratio

like math? me neither!
JavaScript
8
star
16

web-pure-data-front-end

a gui for the pure-data language written for the web
JavaScript
8
star
17

slow

whoa easy there javascript
JavaScript
6
star
18

compromise-highlight

syntax-highlighting for natural language text
JavaScript
6
star
19

wikidata-freebase

helping out in the wikidata migration
JavaScript
6
star
20

spacetime-week

you thought weeks were simple. you weren't right.
JavaScript
5
star
21

sunday-driver

be cool with large files
JavaScript
5
star
22

spacetime-ticks

calculate some sensible break-points between two dates
JavaScript
5
star
23

front_yard

where is the semantic web, if it's not out in front of your own house.
5
star
24

table-turn

html-table parser on the command line
JavaScript
4
star
25

simple_english

simplify natural language english in javascript
JavaScript
4
star
26

somehow-maps

make a map without thinking
JavaScript
4
star
27

compromise-align

generate html aligned by specific text matches
JavaScript
4
star
28

dumpster-dip

parse a wikipedia dump into tiny files
JavaScript
3
star
29

suffix-thumb

find the optimal transformations between words
JavaScript
3
star
30

somehow-calendar

calendar visualization
JavaScript
3
star
31

spacetime-daylight

calculate sunlight exposure for a given timezone
JavaScript
3
star
32

git-slop

cleaner git commands
JavaScript
3
star
33

somehow-circle

an easy way to make radial infographics
JavaScript
3
star
34

townhouse

a new-tab page with browsing history
CoffeeScript
3
star
35

a_wall_map

make a large satelite image to print as a wall map
JavaScript
2
star
36

NLP-OSS-2020

Talk for EMNLP 2020
JavaScript
2
star
37

scratch

first toil, then grave
JavaScript
2
star
38

wtf-plugin-nsfw

WIP content classifier for wikipedia articles
JavaScript
2
star
39

amble

a watch script for cleaner development
JavaScript
2
star
40

wrestlejs

a gui for goofing around with json files
JavaScript
2
star
41

wiki-summary

generate configurable-length descriptions from wikipedia articles
JavaScript
2
star
42

Dirty.js

do questionable things to the js built-in methods
JavaScript
2
star
43

wtf-plugin-mlb

parse baseball game data from wikipedia
JavaScript
2
star
44

crop-aspect

crop an image by a nearby aspect-ratio
JavaScript
2
star
45

nhl_scrape

scrape nhl.com schedule data into json
JavaScript
2
star
46

wtf-plugin-nhl

parse NHL data from wikipedia
JavaScript
2
star
47

Mount-Heavy

an android application to see pictures of nearby people that are no longer living
1
star
48

bitbar

widgets for matryer/bitbar
1
star
49

Spencer-bookmarklets

bookmarklets by spencer
1
star
50

garbage-patch

not the smartest json-patch implementation
JavaScript
1
star
51

Spencer-is-also-ubiquituous

ubiquity commands by spencer
1
star
52

somehow-timeline

a svelte component for layout with time as y-axis
JavaScript
1
star
53

freebase_garden

reasonable workflows for getting wikipedia data into freebase
JavaScript
1
star
54

spencers-chrome-extensions

chrome extensions by spencer
JavaScript
1
star
55

osm_yeah

a workflow for getting openstreetmap data presented into a browser
HTML
1
star
56

spencer-css

some tachyons-inspired css classes
CSS
1
star
57

spencermountain.github.io

yes sir, i do
JavaScript
1
star
58

somehow-ticks

generate nice axis-markings between two arbitrary numbers
JavaScript
1
star
59

somehow-script

a natural-language data-entry format
JavaScript
1
star
60

scal

modern UNIX cal command
JavaScript
1
star
61

somehow-sankey

WIP svelte sankey diagram component
JavaScript
1
star