• Stars
    star
    739
  • Rank 59,264 (Top 2 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 9 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a pretty-committed wikipedia markup parser
wtf_wikipedia
parse data from wikipedia
npm install wtf_wikipedia

it is very, very hard. Β  Β  Β  Β  we're not joking.
why do we always do this?
we put our information where we can't take it out.

import wtf from 'wtf_wikipedia'

let doc = await wtf.fetch('Toronto Raptors')
let coach = doc.infobox().get('coach')
coach.text() //'Nick Nurse'

.text()

get clean plaintext:

let str = `[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall. <ref>Field of our Fathers: By Richard Johnson</ref>`
wtf(str).text()
// "Boston's baseball field has a 37ft wall."
let doc = await wtf.fetch('Glastonbury', 'en')
doc.sentences()[0].text()
// 'Glastonbury is a town and civil parish in Somerset, England, situated at a dry point ...'

.json()

get all the data from a page:

let doc = await wtf.fetch('Whistling')

doc.json()
// { categories: ['Oral communication', 'Vocal skills'], sections: [{ title: 'Techniques' }], ...}

the default .json() output is really verbose, but you can cherry-pick data by poking-around like this:

// get just the links:
doc.links().map((link) => link.json())
//[{ page: 'Theatrical superstitions', text: 'supersitions' }]

// just the images:
doc.images()[0].json()
// { file: 'Image:Duveneck Whistling Boy.jpg', url: 'https://commons.wiki...' }

// json for a particular section:
doc.section('see also').links()[0].json()
// { page: 'Slide Whistle' }

run it on the client-side:

<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
  wtf.fetch('Radiohead', {'Api-User-Agent': 'Name your script here'}, function (err, doc) {
    let members = doc.infobox().get('current members')
    members.links().map((l) => l.page())
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  })
</script>

or the server-side:

import wtf from 'wtf_wikipedia'
// or,
const wtf = require('wtf_wikipedia')

full wikipedia dumps

With this library, in conjunction with dumpster-dive, you can parse the whole english wikipedia in an aftertoon.

npm install -g dumpster-dive

Ok first, πŸ›€

Wikitext is no small thing.

Consider:

this library supports many recursive shenanigans, depreciated and obscure template variants, and illicit wiki-shorthands.

What it does:

  • Detects and parses redirects and disambiguation pages
  • Parse infoboxes into a formatted key-value object
  • Handles recursive templates and links- like [[.. [[...]] ]]
  • Per-sentence plaintext and link resolution
  • Parse and format internal links
  • creates image thumbnail urls from File:XYZ.png filenames
  • Properly resolve dynamic templates like {{CURRENTMONTH}} and {{CONVERT ..}}
  • Parse images, headings, and categories
  • converts 'DMS-formatted' (59Β°12'7.7"N) geo-coordinates to lat/lng
  • parse and combine citation and reference metadata
  • Eliminate xml, latex, css, and table-sorting cruft

What doesn't do:

  • external 'transcluded' page data [1]
  • AST output
  • smart (or 'pretty') formatting of html in infoboxes or galleries [1]
  • maintain perfect page order [1]
  • per-sentence references (by 'section' element instead)
  • maintain template or infobox css styling
  • large tables that span different sections [1]

It is built to be as flexible as possible. In all cases, tries to fail in considerate ways.

How about html scraping..?

Wikimedia's official parser turns wikitext βž” HTML.

if you prefer this screen-scraping workflow, you can pluck at parts of a page like that.

that's cool!

getting structured data this way is still a complex, weird process. Manually spelunking the html is sometimes just as tricky and error-prone as scanning the wikitext itself.

The contributors to this library have come to that conclusion, as many others have.

This library is gracious to the Parsoid contributors.

okay,

flip your wikitext into a Doc object

import wtf from 'wtf_wikipedia'

let txt = `
==Wood in Popular Culture==
* Harry Potter's wand
* The Simpson's fence
`
wtf(txt)
// Document {text(), json(), lists()...}

doc.links()

let txt = `Whistling is featured in a number of television shows, such as [[Lassie (1954 TV series)|''Lassie'']], and the title theme for ''[[The X-Files]]''.`
wtf(txt).links().map((l) => l.page())
// [ 'Lassie (1954 TV series)',  'The X-Files' ]

doc.text()

returns nice plain-text of the article

let txt = "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>"
wtf(txt).text()
//"Boston's baseball field has a 37ft wall."

doc.sections():

a section is a heading '==Like This=='

wtf(page).sections()[1].children() //traverse nested sections
wtf(page).section('see also').remove() //delete one

doc.sentences()

let s = wtf(page).sentences()[4]
s.links()
s.bolds()
s.italics()

doc.categories()

await wtf.fetch('Whistling').categories()
//['Oral communication', 'Vocal music', 'Vocal skills']

doc.images()

let img = wtf(page).images()[0]
img.url() // the full-size wikimedia-hosted url
img.thumbnail() // 300px, by default
img.format() // jpg, png, ..

Fetch

You can grab and parse articles from any wiki api. This includes any language, any wiki-project, and most 3rd-party wikis.

// 3rd-party wiki
let doc = await wtf.fetch('https://muppet.fandom.com/wiki/Miss_Piggy')

// wikipedia français
doc = await wtf.fetch('Tony Hawk', 'fr')
doc.sentence().text() // 'Tony Hplawk est un skateboarder professionnel et un acteur ...'

// accept an array, or wikimedia pageIDs
let docs = wtf.fetch(['Whistling', 2983], { follow_redirects: false })

// article from german wikivoyage
wtf.fetch('Toronto', { lang: 'de', wiki: 'wikivoyage' }).then((doc) => {
  console.log(doc.sentences()[0].text()) // 'Toronto ist die Hauptstadt der Provinz Ontario'
})

you may also pass the wikipedia page id as parameter instead of the page title:

let doc = await wtf.fetch(64646, 'de')

the fetch method follows redirects.

API plugin

wtf.getCategoryPages(title, [options])

retrieves all pages and sub-categories belonging to a given category:

wtf.extend(require('wtf-plugin-api'))
let result = await wtf.getCategoryPages('Category:Politicians_from_Paris')
/*
{
  [
    {"pageid":52502362,"ns":0,"title":"William Abitbol"},
    {"pageid":50101413,"ns":0,"title":"Marie-Joseph Charles des Acres de L'Aigle"}
    ...
    {"pageid":62721979,"ns":14,"title":"Category:Councillors of Paris"},
    {"pageid":856891,"ns":14,"title":"Category:Mayors of Paris"}
  ]
}
*/

wtf.random([options])

fetches a random wikipedia article, from a given language or domain

wtf.extend(require('wtf-plugin-api'))
wtf.random().then((doc) => {
  console.log(doc.title(), doc.categories())
  //'Whistling'  ['Oral communication', 'Vocal skills']
})

see wtf-plugin-api

Tutorials

Plugins

these add all sorts of new functionality:

wtf.extend(require('wtf-plugin-classify'))
await wtf.fetch('Toronto Raptors').classify()
// 'Organization/SportsTeam'

wtf.extend(require('wtf-plugin-summary'))
await wtf.fetch('Pulp Fiction').summary()
// 'a 1994 American crime film'

wtf.extend(require('wtf-plugin-person'))
await wtf.fetch('David Bowie').birthDate()
// {year:1947, date:8, month:1}

wtf.extend(require('wtf-plugin-i18n'))
await wtf.fetch('Ziggy Stardust', 'fr').infobox().json()
// {nom:{text:"Ziggy Stardust"}, oeuvre:{text:"The Rise and Fall of Ziggy Stardust"}}
Plugin
classify person/place/thing
summary short description text
person birth/death information
api fetch more data from the API
i18n improves multilingual template coverage
wtf-mlb fetch baseball data
wtf-nhl fetch hockey data
nsfw flag sexual/graphic/adult articles
image additional methods for .images()
html output html
wikitext output wikitext
markdown output markdown
latex output latex

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

  • pass a Api-User-Agent as something so they can use to easily throttle bad scripts
  • bundle multiple pages into one request as an array (say, groups of 5?)
  • run it serially, or at least, slowly.
wtf
  .fetch(['Royal Cinema', 'Aldous Huxley'], {
    lang: 'en',
    'Api-User-Agent': '[email protected]',
  })
  .then((docList) => {
    let links = docList.map((doc) => doc.links())
    console.log(links)
  })

Full API

  • .title() - get/set the title of the page from the first-sentence
  • .pageID() - get/set the wikimedia id of the page, if we have it.
  • .wikidata() - get/set the wikidata id of the page, if we have it.
  • .domain() - get/set the domain of the wiki we're on, if we have it.
  • .url() - (try to) generate the url for the current article
  • .lang() - get/set the current language (used for url method)
  • .namespace() - get/set the wikimedia namespace of the page, if we have it
  • .isRedirect() - if the page is just a redirect to another page
  • .redirectTo() - the page this redirects to
  • .isDisambiguation() - is this a placeholder page to direct you to one-of-many possible pages
  • .categories() - return all categories of the document
  • .sections() - return a list of the Document's sections
  • .paragraphs() - return a list of Paragraphs, in all sections
  • .sentences() - return a list of all sentences in the document
  • .images() - return all images found in the document
  • .links() - return a list of all links, in all parts of the document
  • .lists() - sections in a page where each line begins with a bullet point
  • .tables() - return a list of all structured tables in the document
  • .templates() - any type of structured-data elements, typically wrapped in like {{this}}
  • .infoboxes() - specific type of template, that appear on the top-right of the page
  • .references() - return a list of 'citations' in the document
  • .coordinates() - geo-locations that appear on the page
  • .text() - plaintext, human-readable output for the page
  • .json() - a 'stringifyable' output of the page's main data
  • .wikitext() - original wiki markup

Section

  • .title() - the name of the section, between ==these tags==
  • .index() - which number section is this, in the whole document.
  • .indentation() - how many steps deep into the table of contents it is
  • .sentences() - return a list of sentences in this section
  • .paragraphs() - return a list of paragraphs in this section
  • .links() - list of all links, in all paragraphs and templates
  • .tables() - list of all html tables
  • .templates() - list of all templates in this section
  • .infoboxes() - list of all infoboxes found in this section
  • .coordinates() - list of all coordinate templates found in this section
  • .lists() - list of all lists in this section
  • .interwiki() - any links to other language wikis
  • .images() - return a list of any images in this section
  • .references() - return a list of 'citations' in this section
  • .remove() - remove the current section from the document
  • .nextSibling() - a section following this one, under the current parent: eg. 1920s β†’ 1930s
  • .lastSibling() - a section before this one, under the current parent: eg. 1930s β†’ 1920s
  • .children() - any sections more specific than this one: eg. History β†’ [PreHistory, 1920s, 1930s]
  • .parent() - the section, broader than this one: eg. 1920s β†’ History
  • .text() - readable plaintext for this section
  • .json() - return all section data
  • .wikitext() - original wiki markup

Paragraph

  • .sentences() - return a list of sentence objects in this paragraph
  • .references() - any citations, or references in all sentences
  • .lists() - any lists found in this paragraph
  • .images() - any images found in this paragraph
  • .links() - list of all links in all sentences
  • .interwiki() - any links to other language wikis
  • .text() - generate readable plaintext for this paragraph
  • .json() - generate some generic data for this paragraph in JSON format
  • .wikitext() - original wiki markup

Sentence

  • .links() - list of all links
  • .bolds() - list of all bold texts
  • .italics() - list of all italic formatted text
  • .json() - return all sentence data
  • .wikitext() - original wiki markup

Image

  • .url() - return url to full size image
  • .thumbnail() - return url to thumbnail (pass size to customize)
  • .links() - any links from the caption (if present)
  • .format() - get file format (e.g. jpg)
  • .json() - return some generic metadata for this image
  • .text() - does nothing
  • .wikitext() - original wiki markup

Template

  • .text() - does this template generate any readable plaintext?
  • .json() - get all the data for this template
  • .wikitext() - original wiki markup

Infobox

  • .links() - any internal or external links in this infobox
  • .keyValue() - generate simple key:value strings from this infobox
  • .image() - grab the main image from this infobox
  • .get() - lookup properties from their key
  • .template() - which infobox, eg 'Infobox Person'
  • .text() - generate readable plaintext for this infobox
  • .json() - generate some generic 'stringifyable' data for this infobox
  • .wikitext() - original wiki markup

List

  • .lines() - get an array of each member of the list
  • .links() - get all links mentioned in this list
  • .text() - generate readable plaintext for this list
  • .json() - generate some generic easily-parsable data for this list
  • .wikitext() - original wiki markup

Reference

  • .title() - generate human-facing text for this reference
  • .links() - get any links mentioned in this reference
  • .text() - returns nothing
  • .json() - generate some generic metadata data for this reference
  • .wikitext() - original wiki markup

Table

  • .links() - get any links mentioned in this table
  • .keyValue() - generate a simple list of key:value objects for this table
  • .text() - returns nothing
  • .json() - generate some useful metadata data for this table
  • .wikitext() - original wiki markup

Configuration

Adding new methods:

you can add new methods to any class of the library, with wtf.extend()

wtf.extend((models) => {
  // throw this method in there...
  models.Doc.prototype.isPerson = function () {
    return this.categories().find((cat) => cat.match(/people/))
  }
})

await wtf.fetch('Stephen Harper').isPerson()

Adding new templates:

does your wiki use a {{foo}} template? Add a custom parser for it:

wtf.extend((models, templates) => {
  // create a custom parser function
  templates.foo = (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return 'new-text'
  }

  // array-syntax allows easy-labeling of parameters
  templates.foo = ['a', 'b', 'c']

  // number-syntax for returning by param # '{{name|zero|one|two}}'
  templates.baz = 0

  // replace the template with a string '{{asterisk}}' -> '*'
  templates.asterisk = '*'
})

by default, if there's no parser for a template, it will be just ignored and generate an empty string. However, it's possible to configure a fallback parser function to handle these templates:

wtf('some {{weird_template}} here', { 
  templateFallbackFn: (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return '[unsupported template]' // or return null to ignore this template
  }
})

you can determine which templates are understood to be 'infoboxes' with the 3rd parameter:

wtf.extend((models, templates, infoboxes) => {
  Object.assign(infoboxes, { person: true, place: true, thing: true })
})

Notes:

3rd-party wikis

by default, a public API is provided by a installed mediawiki application. This means that most wikis have an open api, even if they don't realize it. Some wikis may turn this feature off.

It can usually be found by visiting http://mywiki.com/api.php

to fetch pages from a 3rd-party wiki:

wtf.fetch('Kermit', { domain: 'muppet.fandom.com' }).then((doc) => {
  console.log(doc.text())
})

some wikis will change the path of their API, from ./api.php to elsewhere. If your api has a different path, you can set it like so:

wtf.fetch('2016-06-04_-_J.Fernandes_@_FIL,_Lisbon', { domain: 'www.mixesdb.com', path: 'db/api.php' }).then((doc) => {
  console.log(doc.template('player').json())
})

for image-urls to work properly, the wiki should also have Special:Redirect enabled. Some wikis, (like wikia) have intentionally disabled this.

i18n and multi-language:

wikitext is (amazingly) used across all languages, wikis, and even in right-to-left languages. This parser actually does an okay job at it too.

Wikipedia I18n langauge information for Redirects, Infoboxes, Categories, and Images are included in the library, with pretty-decent coverage.

To improve coverage of i18n templates, use wtf-plugin-i18n

Please make a PR if you see something missing for your language.

Builds:

this library ships seperate client-side and server-side builds, to preserve filesize.

the browser version uses fetch() and the server version uses require('https').

Performance:

It is not the fastest parser, and is very unlikely to beat a single-pass parser in C or Java.

Using dumpster-dive, this library can parse a full english wikipedia in around 4 hours on a macbook.

That's about 100 pages/second, per thread.

See also:

Other alternative javascript parsers:

and many more!

MIT

More Repositories

1

compromise

modest natural-language processing
JavaScript
11,140
star
2

spacetime

A lightweight javascript timezone library
JavaScript
3,822
star
3

dumpster-dive

roll a wikipedia dump into mongo
JavaScript
229
star
4

unrequired

find unused javascript files in your project
JavaScript
109
star
5

Freebase.js

inference and inspection on freebase data
JavaScript
107
star
6

efrt

neato compression for key-value data
JavaScript
90
star
7

famousd3

get famo.us to render d3js components
JavaScript
27
star
8

somehow-graph

Svelte infographics component
JavaScript
27
star
9

clooney

a graphing library in the famo.us engine
JavaScript
20
star
10

timezone-soft

parse informal timezone names
JavaScript
20
star
11

out-of-character

remove invisible unicode characters
JavaScript
17
star
12

thensome

i guess we'll find out.
JavaScript
15
star
13

spacetime-geo

determine date/time using geo-location
JavaScript
11
star
14

somehow

a number of Svelte infographics
9
star
15

fit-aspect-ratio

like math? me neither!
JavaScript
8
star
16

web-pure-data-front-end

a gui for the pure-data language written for the web
JavaScript
8
star
17

slow

whoa easy there javascript
JavaScript
6
star
18

compromise-highlight

syntax-highlighting for natural language text
JavaScript
6
star
19

wikidata-freebase

helping out in the wikidata migration
JavaScript
6
star
20

spacetime-week

you thought weeks were simple. you weren't right.
JavaScript
5
star
21

sunday-driver

be cool with large files
JavaScript
5
star
22

spacetime-ticks

calculate some sensible break-points between two dates
JavaScript
5
star
23

front_yard

where is the semantic web, if it's not out in front of your own house.
5
star
24

table-turn

html-table parser on the command line
JavaScript
4
star
25

simple_english

simplify natural language english in javascript
JavaScript
4
star
26

somehow-maps

make a map without thinking
JavaScript
4
star
27

compromise-align

generate html aligned by specific text matches
JavaScript
4
star
28

dumpster-dip

parse a wikipedia dump into tiny files
JavaScript
3
star
29

suffix-thumb

find the optimal transformations between words
JavaScript
3
star
30

somehow-calendar

calendar visualization
JavaScript
3
star
31

spacetime-daylight

calculate sunlight exposure for a given timezone
JavaScript
3
star
32

git-slop

cleaner git commands
JavaScript
3
star
33

somehow-circle

an easy way to make radial infographics
JavaScript
3
star
34

townhouse

a new-tab page with browsing history
CoffeeScript
3
star
35

a_wall_map

make a large satelite image to print as a wall map
JavaScript
2
star
36

NLP-OSS-2020

Talk for EMNLP 2020
JavaScript
2
star
37

scratch

first toil, then grave
JavaScript
2
star
38

wtf-plugin-nsfw

WIP content classifier for wikipedia articles
JavaScript
2
star
39

amble

a watch script for cleaner development
JavaScript
2
star
40

wrestlejs

a gui for goofing around with json files
JavaScript
2
star
41

wiki-summary

generate configurable-length descriptions from wikipedia articles
JavaScript
2
star
42

Dirty.js

do questionable things to the js built-in methods
JavaScript
2
star
43

wtf-plugin-mlb

parse baseball game data from wikipedia
JavaScript
2
star
44

crop-aspect

crop an image by a nearby aspect-ratio
JavaScript
2
star
45

nhl_scrape

scrape nhl.com schedule data into json
JavaScript
2
star
46

wtf-plugin-nhl

parse NHL data from wikipedia
JavaScript
2
star
47

Mount-Heavy

an android application to see pictures of nearby people that are no longer living
1
star
48

bitbar

widgets for matryer/bitbar
1
star
49

Spencer-bookmarklets

bookmarklets by spencer
1
star
50

garbage-patch

not the smartest json-patch implementation
JavaScript
1
star
51

Spencer-is-also-ubiquituous

ubiquity commands by spencer
1
star
52

somehow-timeline

a svelte component for layout with time as y-axis
JavaScript
1
star
53

freebase_garden

reasonable workflows for getting wikipedia data into freebase
JavaScript
1
star
54

spencers-chrome-extensions

chrome extensions by spencer
JavaScript
1
star
55

osm_yeah

a workflow for getting openstreetmap data presented into a browser
HTML
1
star
56

spencer-css

some tachyons-inspired css classes
CSS
1
star
57

spencermountain.github.io

yes sir, i do
JavaScript
1
star
58

somehow-ticks

generate nice axis-markings between two arbitrary numbers
JavaScript
1
star
59

somehow-script

a natural-language data-entry format
JavaScript
1
star
60

scal

modern UNIX cal command
JavaScript
1
star
61

somehow-sankey

WIP svelte sankey diagram component
JavaScript
1
star