• Stars
    star
    170
  • Rank 222,843 (Top 5 %)
  • Language
    Ruby
  • License
    MIT License
  • Created over 12 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A command-line toolkit to extract text content and category data from Wikipedia dump files

A command-line toolkit to extract text content and category data from Wikipedia dump files

About

WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.

Changelog

May 2023

  • Problems caused by too many parallel processors are addressed by setting the upper limit on the number of processors to 8.

April 2023

  • File split/delete issues fixed

January 2023

  • Bug related to command line arguments fixed
  • Code cleanup introducing Rubocop

December 2022

  • Docker images available via Docker Hub

November 2022

  • Code added to suppress "Invalid byte sequence error" when an ilegal UTF-8 character is input.

August 2022

  • A new option --category-only has been added. When this option is enabled, only the title and category information of the article is extracted.
  • A new option --summary-only has been added. If this option is enabled, only the title, category information, and opening paragraphs of the article will be extracted.
  • Text conversion with the current version of WP2TXT is more than 2x times faster than the previous version due to parallel processing of multiple files (the rate of speedup depends on the CPU cores used for processing).

Screenshot

Environment

  • WP2TXT 1.0.1
  • MacBook Pro (2021 Apple M1 Pro)
  • enwiki-20220720-pages-articles.xml.bz2 (19.98 GB)

In the above environment, the process (decompression, splitting, extraction, and conversion) to obtain the plain text data of the English Wikipedia takes less than 1.5 hours.

Features

  • Converts Wikipedia dump files in various languages
  • Creates output files of specified size
  • Allows specifying ext elements (page titles, section headers, paragraphs, list items) to be extracted
  • Allows extracting category information of the article
  • Allows extracting opening paragraphs of the article

Setting Up

WP2TXT on Docker

  1. Install Docker Desktop (Mac/Windows/Linux)
  2. Execute docker command in a terminal:
docker run -it -v /Users/me/localdata:/data yohasebe/wp2txt
  • Make sure to Replace /Users/me/localdata with the full path to the data directory in your local computer
  1. The Docker image will begin downloading and a bash prompt will appear when finished.
  2. The wp2txt command will be avalable anywhare in the Docker container. Use the /data directory as the location of the input dump files and the output text files.

IMPORTANT:

  • Configure Docker Desktop resource settings (number of cores, amount of memory, etc.) to get the best performance possible.
  • When running the wp2txt command inside a Docker container, be sure to set the output directory to somewhere in the mounted local directory specified by the docker run command.

WP2TXT on MacOS and Linux

WP2TXT requires that one of the following commands be installed on the system in order to decompress bz2 files:

  • lbzip2 (recommended)
  • pbzip2
  • bzip2

In most cases, the bzip2 command is pre-installed on the system. However, since lbzip2 can use multiple CPU cores and is faster than bzip2, it is recommended that you install it additionally. WP2TXT will attempt to find the decompression command available on your system in the order listed above.

If you are using MacOS with Homebrew installed, you can install lbzip2 with the following command:

$ brew install lbzip2

WP2TXT on Windows

Install Bzip2 for Windows and set the path so that WP2TXT can use the bunzip2.exe command. Alternatively, you can extract the Wikipedia dump file in your own way and process the resulting XML file with WP2TXT.

Installation

WP2TXT command

$ gem install wp2txt

Wikipedia Dump File

Download the latest Wikipedia dump file for the desired language at a URL such as

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Here, enwiki refers to the English Wikipedia. To get the Japanese Wikipedia dump file, for instance, change this to jawiki (Japanese). In doing so, note that there are two instances of enwiki in the URL above.

Alternatively, you can also select Wikipedia dump files created on a specific date from here. Make sure to download a file named in the following format:

xxwiki-yyyymmdd-pages-articles.xml.bz2

where xx is language code such as en (English)" or ja (japanese), and yyyymmdd is the date of creation (e.g. 20220801).

Basic Usage

Suppose you have a folder with a wikipedia dump file and empty subfolders organized as follows:

.
โ”œโ”€โ”€ enwiki-20220801-pages-articles.xml.bz2
โ”œโ”€โ”€ /xml
โ”œโ”€โ”€ /text
โ”œโ”€โ”€ /category
โ””โ”€โ”€ /summary

Decompress and Split

The following command will decompress the entire wikipedia data and split it into many small (approximately 10 MB) XML files.

$ wp2txt --no-convert -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./xml

Note: The resulting files are not well-formed XML. They contain part of the orignal XML extracted from the Wikipedia dump file, taking care to ensure that the content within the tag is not split into multiple files.

Extract plain text from MediaWiki XML

$ wp2txt -i ./xml -o ./text

Extract only category info from MediaWiki XML

$ wp2txt -g -i ./xml -o ./category

Extract opening paragraphs from MediaWiki XML

$ wp2txt -s -i ./xml -o ./summary

Extract directly from bz2 compressed file

It is possible (though not recommended) to 1) decompress the dump files, 2) split the data into files, and 3) extract the text just one line of command. You can automatically remove all the intermediate XML files with -x option.

$ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text -x

Sample Output

Output contains title, category info, paragraphs

$ wp2txt -i ./input -o /output

Output containing title and category only

$ wp2txt -g -i ./input -o /output

Output containing title, category, and summary

$ wp2txt -s -i ./input -o /output

Command Line Options

Command line options are as follows:

Usage: wp2txt [options]
where [options] are:
  -i, --input                      Path to compressed file (bz2) or decompressed file (xml), or path to directory containing files of the latter format
  -o, --output-dir=<s>             Path to output directory
  -c, --convert, --no-convert      Output in plain text (converting from XML) (default: true)
  -a, --category, --no-category    Show article category information (default: true)
  -g, --category-only              Extract only article title and categories
  -s, --summary-only               Extract only article title, categories, and summary text before first heading
  -f, --file-size=<i>              Approximate size (in MB) of each output file (default: 10)
  -n, --num-procs                  Number of proccesses (up to 8) to be run concurrently (default: max num of available CPU cores minus two)
  -x, --del-interfile              Delete intermediate XML files from output dir
  -t, --title, --no-title          Keep page titles in output (default: true)
  -d, --heading, --no-heading      Keep section titles in output (default: true)
  -l, --list                       Keep unprocessed list items in output
  -r, --ref                        Keep reference notations in the format [ref]...[/ref]
  -e, --redirect                   Show redirect destination
  -m, --marker, --no-marker        Show symbols prefixed to list items, definitions, etc. (Default: true)
  -b, --bz2-gem                    Use Ruby's bzip2-ruby gem instead of a system command
  -v, --version                    Print version and exit
  -h, --help                       Show this message

Caveats

  • Some data, such as mathematical formulas and computer source code, will not be converted correctly.
  • Some text data may not be extracted correctly for various reasons (incorrect matching of begin/end tags, language-specific formatting rules, etc.).
  • The conversion process can take longer than expected. When dealing with a huge data set such as the English Wikipedia on a low-spec environment, it can take several hours or more.

Useful Links

Author

References

The author will appreciate your mentioning one of these in your research.

Or use this BibTeX entry:

@misc{wp2txt_2023,
  author = {Yoichiro Hasebe},
  title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
  url = {https://github.com/yohasebe/wp2txt},
  year = {2023}
}

License

This software is distributed under the MIT License. Please see the LICENSE file.

More Repositories

1

openai-chat-api-workflow

๐ŸŽฉ An Alfred 5 Workflow for using OpenAI Chat API to interact with GPT-3.5/GPT-4 ๐Ÿค–๐Ÿ’ฌ It also allows image generation ๐Ÿ–ผ๏ธ, image understanding ๐Ÿ‘€, speech-to-text conversion ๐ŸŽค, and text-to-speech synthesis ๐Ÿ”ˆ
293
star
2

engtagger

English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
Ruby
258
star
3

lemmatizer

Lemmatizer for text in English. Inspired by Python's nltk.corpus.reader.wordnet.morphy
Ruby
109
star
4

rsyntaxtree

Syntax tree generator for linguistic research
Ruby
96
star
5

whisper-stream

A bash script using OpenAI Whisper API for continuous audio transcription with automatic silence detection
Shell
71
star
6

ruby-spacy

A wrapper module for using spaCy natural language processing library from the Ruby programming language via PyCall
Ruby
59
star
7

fzf-alfred-workflow

An Alfred workflow fo fuzzy find files/directories using fzf and fd.
53
star
8

deepl-alfred-translate-rewrite-workflow

An Alfred workflow to help translate and rewrite text using DeepL API
31
star
9

monadic-chat

๐Ÿค– + ๐Ÿณ + ๐Ÿง Monadic Chat is a framework designed to create and use intelligent chatbots. By providing a full-fledged Linux environment on Docker to GPT-4 and other LLMs, it allows the chatbots to perform advanced tasks that require external tools for searching, coding, testing, analysis, visualization, and more.
Ruby
20
star
10

fastmail-plus

A Chrome extension to make Fastmail web UI more usable and productive
JavaScript
20
star
11

vim-command-workflow

An Alfred workflow to search Vim command cheat sheet + type commands
Ruby
18
star
12

rginger

RGinger takes an English sentence and gives correction and rephrasing suggestions for it using Ginger proofreading API.
Ruby
17
star
13

monadic-chat-cli

Highly configurable CLI app for OpenAI's chat/text completion API
Ruby
11
star
14

rubyfca

Command line tool for Formal Concept Analysis written in Ruby
Ruby
7
star
15

finder-unclutter

An Alfred ๐ŸŽฉ workflow that removes duplicate Finder tabs and windows and arranges them into a single or dual-pane ๐Ÿ‘“ layout for a cleaner desktop experience ๐Ÿ–ฅ๏ธ ๐Ÿงน
5
star
16

code-packager

๐Ÿ“ฆ A bash script that packages your codebase into a single JSON file, ready to be analyzed and understood by large language models (LLMs) like GPT-4, Claude, Command R, and Gemini ๐Ÿค–
Shell
5
star
17

rubyplb

Command line Pattern Lattice building tool written in Ruby
Ruby
4
star
18

paradocs

Paradocs: A Paragraph-Oriented Text Document Presentation System
4
star
19

objective-wordnet

3
star
20

mac-dictionary-selector

An Alfred3 Workflow that lets you quickly look up words from a variety of dictionaries preinstalled in OSX
Ruby
3
star
21

ruby-wordle

A set of ruby scripts to generate word-lists, solve Wordle and play Wordle
Ruby
2
star
22

quickanswers

QuickAnswers
JavaScript
1
star
23

rsyntaxtree_web

JavaScript
1
star
24

speak_slow

SpeakSlow modifies audio files adding pauses and/or altering speed to suit for language study
Ruby
1
star