• Stars
    star
    131
  • Rank 267,206 (Top 6 %)
  • Language
    Nim
  • License
    MIT License
  • Created about 9 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract a plain text corpus from MediaWiki XML dumps, such as Wikipedia.

wiki2text (unmaintained)

I don't work on this project anymore, and the Nim language has probably moved far beyond the version I used to make this. You may be able to make it work for you, but no guarantees.

I now work on a full-blown LL parser for wikitext, wikiparsec.

Original introduction

What you put in: a .xml.bz2 file downloaded from Wikimedia

What you get out: gigabytes of clean natural language text

wiki2text is a fast pipeline that takes a MediaWiki XML dump -- such as the exports of Wikipedia that you can download from dumps.wikimedia.org -- and extract just the natural-language text from them, skipping the Wiki formatting characters and the HTML tags.

This is particularly useful as a way to get free corpora, in many languages, for natural language processing.

The only formatting you will get is that the titles of new articles and new sections will appear on lines that start and end with some number of = signs. I've found it useful to distinguish titles from body text. If you don't need this, these lines are easy to exclude using grep -v.

wiki2text is written with these goals in mind:

  • Clean code that is clear about what it's doing
  • Doing no more than is necessary
  • Being incredibly fast (it parses an entire Wikipedia in minutes)
  • Being usable as a step in a pipeline

Thanks to def-, a core Nim developer, for making optimizations that make the code so incredibly fast.

Why Nim?

Why is this code written in a fast-moving, emerging programming language? It's an adaptation of a Python script that took days to run. Nim allowed me to keep the understandability of Python but also have the speed of C.

Setup

wiki2text needs to be compiled using Nim 0.11. Install it by following the directions on Nim's download page.

You can build Nim from this repository by running:

make

You can also install it using Nimble, Nim's package manager, instead:

nimble install wiki2text

Usage

Download one of the database dumps from dumps.wikimedia.org. The filename you want should be the one of the form *-pages-articles.xml.bz2. These files can be many gigabytes in size, so you might want to start with a language besides English, with a smaller number of articles.

But suppose you did download enwiki-DATE-pages-articles.xml.bz2. Then you should run:

bunzip2 -c enwiki-DATE-pages-articles.xml.bz2 | ./wiki2text > enwiki.txt

To skip all headings, run:

bunzip2 -c enwiki-DATE-pages-articles.xml.bz2 | ./wiki2text | grep -v '^=' > enwiki.txt

enwiki.txt will fill up with article text as quickly as it comes out of bunzip2.

Example output

Here's an example of part of the text that comes out of the English Wikipedia (with hard line wrapping added):

= Albedo =

Albedo (), or reflection coefficient, derived from Latin albedo "whiteness"
(or reflected sunlight) in turn from albus "white", is the diffuse
reflectivity or reflecting power of a surface. It is the ratio of reflected
radiation from the surface to incident radiation upon it. Its dimensionless
nature lets it be expressed as a percentage and is measured on a scale from
zero for no reflection of a perfectly black surface to 1 for perfect
reflection of a white surface.

Albedo depends on the frequency of the radiation. When quoted unqualified,
it usually refers to some appropriate average across the spectrum of
visible light. In general, the albedo depends on the directional
distribution of incident radiation, except for Lambertian surfaces, which
scatter radiation in all directions according to a cosine function and
therefore have an albedo that is independent of the incident distribution.
In practice, a bidirectional reflectance distribution function (BRDF) may
be required to accurately characterize the scattering properties of a
surface, but albedo is very useful as a first approximation.

The albedo is an important concept in climatology, astronomy, and
calculating reflectivity of surfaces in LEED sustainable-rating systems for
buildings. The average overall albedo of Earth, its planetary albedo, is 30
to 35% because of cloud cover, but widely varies locally across the surface
because of different geological and environmental features.

The term was introduced into optics by Johann Heinrich Lambert in his 1760
work Photometria.

==Terrestrial albedo==

Albedos of typical materials in visible light range from up to 0.9 for
fresh snow to about 0.04 for charcoal, one of the darkest substances.
Deeply shadowed cavities can achieve an effective albedo approaching the
zero of a black body. When seen from a distance, the ocean surface has a
low albedo, as do most forests, whereas desert areas have some of the
highest albedos among landforms. Most land areas are in an albedo range of
0.1 to 0.4. The average albedo of the Earth is about 0.3. This is far
higher than for the ocean primarily because of the contribution of clouds.
Earth's surface albedo is regularly estimated via Earth observation
satellite sensors such as NASA's MODIS instruments on board the Terra and
Aqua satellites. As the total amount of reflected radiation cannot be
directly measured by satellite, a mathematical model of the BRDF is used to
translate a sample set of satellite reflectance measurements into estimates
of directional-hemispherical reflectance and bi-hemispherical reflectance
(e.g.).

Earth's average surface temperature due to its albedo and the greenhouse
effect is currently about 15°C. If Earth were frozen entirely (and hence be
more reflective) the average temperature of the planet would drop below
−40°C. If only the continental land masses became covered by glaciers, the
mean temperature of the planet would drop to about 0°C. In contrast, if the
entire Earth is covered by water—a so-called aquaplanet—the average
temperature on the planet would rise to just under 27°C.

Limitations

You may notice that occasional words and phrases are missing from the text. These are the parts of the article that come from MediaWiki templates.

Templates are an incredibly complicated, Turing-complete subset of MediaWiki, and are used for everything from simple formatting to building large infoboxes, tables, and navigation boxes.

It would be nice if we could somehow keep only the simple ones and discard the complex ones, but what's easiest to do is to simply ignore every template.

Sometimes templates contain the beginnings or ends of HTML or Wikitable formatting that we would normally skip, in which case extra crud may show up in the article.

This probably doesn't work very well for wikis that have specific, meaningful formatting, such as Wiktionary. The conceptnet5 project includes a slow Wiktionary parser in Python that you might be able to use.

More Repositories

1

python-ftfy

Fixes mojibake and other glitches in Unicode text, after the fact.
Python
3,671
star
2

wordfreq

Access a database of word frequencies, in various natural languages.
Python
669
star
3

langcodes

A Python library for working with and comparing language codes.
Python
332
star
4

ordered-set

A mutable set that remembers the order of its entries. One of Python's missing data types.
Python
204
star
5

dominiate

A simulator for Dominion card game strategies
JavaScript
118
star
6

text-as-data

A PyData 2013 talk on straightforward, data-driven ways to handle natural language text in Python.
Python
50
star
7

wikiparsec

An LL parser for extracting information from Wiki text, particularly Wiktionary.
Haskell
48
star
8

solvertools

Mystery Hunt solving tools for Metropolitan Rage Warehouse. Or anyone really.
JavaScript
28
star
9

scholar.hasfailed.us

Google Scholar is a trans-exclusionary site. Don't use it. Help us demand change.
HTML
20
star
10

dominiate-python

A Python implementation of the card game Dominion
Python
15
star
11

openmind-commons

The dynamic Web site that lets people browse and contribute to Open Mind Common Sense and ConceptNet.
JavaScript
11
star
12

dominionstats

The code behind councilroom.com.
JavaScript
11
star
13

csc-pysparse

A fast sparse matrix library for Python (Commonsense Computing version)
C
10
star
14

music-decomp

Associating music/sound and semantics
Python
10
star
15

mixmaster

Smarter than the average anagrammer.
Python
8
star
16

scorepile

A repository of Innovation games played on Isotropic
JavaScript
6
star
17

language_data

An optional supplement to `langcodes` that stores names and statistics of languages.
Python
6
star
18

solvertools-2014

Julia
4
star
19

adventure

Common sense experiments for working with text adventures.
Python
4
star
20

charcol

An experiment to collect unusual characters from Twitter.
Python
4
star
21

dominion-rank

Calculate ranks based on people's play on dominion.isomorphic.org.
Python
3
star
22

countmerge

A command-line tool that adds counts for sorted keys.
Rust
3
star
23

verb-aspect-learning

A hierarchical Bayesian model of biases in how people learn novel verbs
3
star
24

svdview

A Processing viewer for the results of dimensionality reduction.
Java
3
star
25

spacious_corpus

A corpus build process for use with SpaCy projects
Python
3
star
26

colorizer

JavaScript
2
star
27

irepad

An IRE PROOF collaborative editor, built on FirePad.
JavaScript
2
star
28

rust-nlp-tools

Rust
2
star
29

rspeer-web

My personal Web site.
JavaScript
2
star
30

analogy_farm

A Web-based puzzle from MIT Mystery Hunt 2013.
Python
2
star
31

rspeer.github.io

rspeer's Octopress site
TeX
1
star