• Stars
    star
    258
  • Rank 158,189 (Top 4 %)
  • Language
    Python
  • Created over 11 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This is a mirror of the script by Giuseppe Attardi, and contains history before the official repo started: https://github.com/attardi/wikiextractor --- Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory.

This is a mirror repo for the script by Giuseppe Attardi, and contains history before the official repo started.

Please refer to the official repo if there any issues: https://github.com/attardi/wikiextractor


Wikipedia Extractor

Introduction

The project uses the Italian Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated.

The Wikipedia maintainers provide, each month, an XML dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.

Wikipedia dumps are available from Wikipedia database download.

The Wikipedia extractor tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

Each document in the dump of the encyclopedia is representend as a single XML element, encoded as illustrated in the following example from the document titled Armonium:

 <page>
 <title>Armonium</title>
 <id>2</id>
 <timestamp>2008-06-22T21:48:55Z</timestamp>
 <username>Nemo bis</username>
 <comment>italiano</comment>
 <text xml:space="preserve">thumb|right|300 px

 L'armonium' (in francese, harmonium) è uno
  strumento musicale azionato con una tastiera, detta
 manuale. Sono stati costruiti anche alcuni armonium con due manuali.

 ==Armonium occidentale==
 Come l'organo, l'armonium è utilizzato tipicamente in
 chiesa, per l'esecuzione di musica sacra, ed è
 fornito di pochi registri, quando addirittura in certi casi non ne possiede
 nemmeno uno: il suo timbro è molto meno ricco di quello
 organistico e così pure la sua estensione.

 ...

 ==Armonium indiano==
 Template:S sezione

 == Voci correlate ==
 *Musica
 *Generi musicali</text>

For this document the Wikipedia extractor produces the following plain text:

<doc id="2" url="http://it.wikipedia.org/wiki/Armonium">
Armonium.
L'armonium (in francese, “harmonium”) è uno strumento musicale azionato con
una tastiera, detta manuale. Sono stati costruiti anche alcuni armonium con
due manuali.

Armonium occidentale.
Come l'organo, l'armonium è utilizzato tipicamente in chiesa, per l'esecuzione
di musica sacra, ed è fornito di pochi registri, quando addirittura in certi
casi non ne possiede nemmeno uno: il suo timbro è molto meno ricco di quello
organistico e così pure la sua estensione.
...
</doc>

The extraction tool is written in Python and requires no additional library. it aims to achieve high accuracy in extraction task.

Wikipedia articles are written in the MediaWiki Markup Language, which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It is also posible to insert HTML markup in the documents. Wiki and HTML tags are often misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys several heuristics in order to circumvent such problems. A currently missing feature for the extractor is template expansion.

Description

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file contains several documents in the document format.

Usage:

 WikiExtractor.py [options]

Options:

 -c, --compress        : compress output files using bzip
 -b, --bytes= n[KM]    : put specified bytes per output file (default 500K)
 -B, --base= URL       : base URL for the Wikipedia pages
 -o, --output= dir     : place output files in specified directory (default
                         current)
 -l, --link            : preserve links
 --help                : display this help and exit

Example of Use

The following commands illustrate how to apply the script to a Wikipedia dump:

> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
> bzcat itwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted -

In order to combine the whole extracted text into a single file one can issue:

> find extracted -name '*bz2' -exec bunzip2 -c {} \; > text.xml
> rm -rf extracted

Related Work

More Repositories

1

causeofwhy

The goal of this project is to implement a Question Answering (QA) system that answers causal type questions. We use Wikipedia as a knowledge base, extracting answers to user questions from the articles.
Python
104
star
2

twitter-corpus

Collects all tweets from the sample Public stream using Twitter's streaming API, and saves them to a file for later use as a corpus.
Python
46
star
3

infertweet

Infer information from Tweets. Useful for human-centered computing tasks, such as sentiment analysis, location prediction, authorship profiling and more!
Python
10
star
4

hue-log

Log to a journal the state of each Philips Hue light as they change throughout the day.
Python
8
star
5

haikupy

An English language haiku generator that uses the 5-7-5 syllable pattern.
Python
7
star
6

infer

A machine learning toolkit for classification and assisted experimentation.
Python
4
star
7

rotalh

Get a running count of occurrences from a stream. Intended to replace `sort | uniq -c` when the input is a stream.
Haskell
4
star
8

haskell-pre-commit-hooks

Haskell related hooks for use with the http://pre-commit.com/ framework.
3
star
9

dentonpolice

Scrapes mug shot and inmate information from the City Jail Custody Report page for Denton, TX and posts some of the info (including mug shot) to Twitter.
Python
3
star
10

inferhotspot

Infer information about local hotspots.
Python
2
star
11

simplewsd

An English word sense disambiguation library using WordNet.
Python
2
star
12

rotal

Get a running count of occurrences from a stream. Intended to replace `sort | uniq -c` when the input is a stream.
Python
2
star
13

codingame

CodinGame puzzles, AI bots, and contests.
Haskell
1
star
14

alexa-coffee-maker

Alexa skill for Amazon Echo that helps with coffee questions at home.
Python
1
star