• Stars
    star
    290
  • Rank 142,042 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Import tables from any Wikipedia article as a dataset in Python

wikitables

Documentation Status PyPI version Build Status

Import tables from any Wikipedia article as a dataset in Python

Installing

pip install wikitables

Usage

Importing

Importing all tables from a given article:

from wikitables import import_tables
tables = import_tables('List of cities in Italy')  # returns a list of WikiTable objects

To import an article from a different language, simply add the Wikipedia language code as an argument to import_tables. This will also show country names in the given language.

tables = import_tables('İtalya\'daki_şehirler_listesi', 'tr')  # returns a list of WikiTable objects

Accessing

Iterate over a table's rows:

print(tables[0].name)
for row in tables[0].rows:
    print('{City}: {Area(km2)}'.format(**row))

output:

List of cities in Italy[0]
Milan: 4,450.11
Naples: 3,116.52
Rome: 3,340.41
Turin: 1,328.40
...

Or return the table encoded as JSON:

tables[0].json()

output:

[
    {
        "City": "Milan",
        "Population January 1, 2014": "6,623,798",
        "Density(inh./km2)": "1,488",
        "Area(km2)": "4,450.11"
    },
    {
        "City": "Naples",
        "Population January 1, 2014": "5,294,546",
        "Density(inh./km2)": "1,699",
        "Area(km2)": "3,116.52"
    },
    ...

Table Head

After import, table column names may been modified by setting a new header:

table.head = [ 'newfield1', 'newfield2', 'newfield3' ]

This change will be recursively reflected on all of a given tables rows.

Commandline

Wikitables also comes with a simple cli tool to fetch and output table json:

# from article name
wikitables 'List of cities in Italy'

# from URL
wikitables https://en.wikipedia.org/wiki/Radio_spectrum#ITU

Creating list of DataFrames

from wikitables import import_tables
import pandas as pd


def get_df_from_table_object(table):
    rows = [row for row in table.rows]
    return pd.DataFrame(rows)


def get_list_of_df_of_wiki_article(wiki_title):
    tables = import_tables(wiki_title)
    return [get_df_from_table_object(table) for table in tables]


print(get_list_of_df_of_wiki_article(wiki_title='List of cities in Italy'))

output:

[    Rank         City 2011    Census 2020    Estimate                Change    Region
0      1         Rome        2617175          2856133     9.130379130168986     Lazio
1      2        Milan        1242123          1378689     10.99456334034552  Lombardy
2      3       Naples         962003           959188   -0.2926186300874267  Campania
3      4        Turin         872367           875698   0.38183470947434905  Piedmont
4      5      Palermo         657651           663401    0.8743239195257102    Sicily
..   ...          ...            ...              ...                   ...       ...
139  140  Battipaglia          51133            51005  -0.25032757710284903  Campania
140  141          Rho          50686            50904    0.4300990411553407  Lombardy
141  142       Chieti          54305            50770    -6.509529509253287   Abruzzo
142  143      Scafati          50794            50686   -0.2126235382131747  Campania
143  144    Scandicci          50309            50645    0.6678725476554792   Tuscany

[144 rows x 6 columns]]

Roadmap

Some planned and wishlist features:

  • Type guesing from MediaWiki template values

More Repositories

1

ctop

Top-like interface for container metrics
Go
15,355
star
2

grmon

Command line monitoring for goroutines
Go
1,897
star
3

slackcat

CLI utility to post files and command output to slack
Go
1,218
star
4

jstream

Streaming JSON parser for Go
Go
561
star
5

docker-replay

Generate docker commands to rerun existing containers
Python
197
star
6

go-haproxy

Go library for interacting with HAProxy via command socket
Go
124
star
7

slack-progress

A realtime progress bar for Slack
Python
99
star
8

vim-vice

Dark and vibrant colorscheme for vim
Vim Script
95
star
9

tinycron

A very small replacement for cron
Go
84
star
10

go-units

Go library for converting between various units of measurement
Go
81
star
11

tcolors

Commandline color picker and palette builder
Go
51
star
12

xiny

Simple command line tool for unit conversions
Go
42
star
13

statsquid

🐳 Multi-host container stat aggregator and shipper for Docker
Python
36
star
14

haproxy-stats

haproxy-stats is a small Python library for fetching and parsing stats
Python
30
star
15

termui

Golang terminal dashboard (fork of gizak/termui)
Go
16
star
16

vim-jfmt

Automatically pretty-print and indent JSON files
Vim Script
13
star
17

vimcommands

Website for vimcommands.com
HTML
11
star
18

devpi-tools

Small Python library for interacting with devpi servers via web API
Python
10
star
19

somacli

🎵 Commandline SomaFM internet radio player
Shell
9
star
20

acrophone

Easily convert text to phonetic spelling
Go
8
star
21

cliglobe

A simple spinning commandline globe with gradient colors
Go
6
star
22

dsplice

Docker image merge tool
Python
6
star
23

bfstree

Go package providing breadth-first search functions for arbitrary structs
Go
5
star
24

docker-port-scan

Docker image build providing regular port scan reports via a web interface
Shell
5
star
25

color

Color package for Go
Go
5
star
26

uptime

Uptime is a concurrent, distributed URL checker
Python
4
star
27

slackn

Batch Nagios notifications for Slack
Python
3
star
28

haproxy-top

CLI tool for viewing real-time HAProxy metrics across multiple instances
Python
3
star
29

etcd-discovery

Simple container for running your own local etcd discovery service like https://discovery.etcd.io
Shell
3
star
30

dddnsupdate

dddnsupdate is a small CLI utility to dynamically nsupdate a bind server with discovered docker names and IPs
Python
2
star
31

tagshell

Execute ssh commands to multiple hosts in parallel based on arbitary tags
Python
2
star
32

tzdata

Embedded timezone database for Go
Go
2
star
33

dockertop

A top-like monitor for Docker containers written in Python
Python
2
star
34

multicrane

Multicrane - Superbly simple multi-host Docker orchestration via Crane
Python
2
star
35

haproxyview

Simple web interface for aggregating and displaying HAproxy servers and stats.
HTML
1
star
36

redis-scraper

Dump all lists in redis to flat files
Python
1
star
37

gem2deb

docker container to fetch a gem and build it as a deb package.
Shell
1
star
38

safertar

A safer way to extract tar archives using fakeroot and chroot
Shell
1
star
39

dockerfile-npm-lazy

npm_lazy caching npm registry layer in docker container
JavaScript
1
star
40

py2deb

Docker image to build a deb from Python package
Shell
1
star