• Stars
    star
    112
  • Rank 312,240 (Top 7 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 8 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

command-line tool to extract taxonomies from Wikidata

Wikidata-Taxonomy

Command-line tool and library to extract taxonomies from Wikidata.

Installation

wikidata-taxonomy requires at least NodeJs version 6.

Install globally to make command wdtaxonomy accessible from your shell $PATH:

$ npm install -g wikidata-taxonomy

Installation and usage as module and in web applications is described below.

Usage

This module provides the command wdtaxonomy. By default, a usage help is printed:

$ wdtaxonomy


  Usage: wdtaxonomy [options] <id>

  extract taxonomies from Wikidata


  Options:

    -V, --version                           output the version number
    -b, --brief                             omit counting instances and sites
    -c, --children                          get direct subclasses only
    -C, --color                             enforce color output
    -d, --descr                             include item descriptions
    -e, --sparql-endpoint <url>             customize the SPARQL endpoint
    -f, --format <txt|csv|tsv|json|ndjson>  output format
    -i, --instances                         include instances
    -I, --no-instancecount                  omit counting instances
    -j, --json                              use JSON output format
    -l, --lang <lang>                       specify the language to use
    -L, --no-labels                         omit all labels
    -m, --mappings <ids>                    mapping properties (e.g. P1709)
    -n, --no-colors                         disable color output
    -o, --output <file>                     write result to a file
    -P, --property <id>                     hierarchy property (e.g. P279)
    -R, --prune <criteria>                  prune hierarchies (e.g. mappings)
    -p, --post                              use HTTP POST to disable caching
    -r, --reverse                           get superclasses instead
    -s, --sparql                            print SPARQL query and exit
    -S, --no-sitecount                      omit counting sites
    -t, --total                             count total number of instances
    -u, --user <name>                       user to the SPARQL endpoint
    -U, --uris                              show full URIs in output formats
    -v, --verbose                           make the output more verbose
    -w, --password <string>                 password to the SPARQL endpoint
    -h, --help                              output usage information

The first arguments needs to be a Wikidata identifier to be used as root of the taxonomy. For instance extract a taxonomy of planets (Q634):

$ wdtaxonomy Q634

To look up by label, use wikidata-cli (e.g wd id planet or wd f planet).

The extracted taxonomy by default is based on statements using the property "subclass of" (P279) or "subproperty of" (P1647). Taxonomy extraction and output can be controlled by several options. Option --sparql (or -s) prints the underlying SPARQL queries instead of executing them.

Examples

Direct subclasses of planet (Q634) with description and mappings:

$ wdtaxonomy Q634 -c -d -m =

The hierarchy properties P279 ("subclass of") and P31 ("instance of") to build taxonomies from can be changed with option property (-P).

Members of (P463) the European Union (Q458):

$ wdtaxonomy Q458 -P P463

Members of (P463) the European Union (Q458) and number of its citizens in Wikidata (P27):

$ wdtaxonomy Q458 -P 463/27

Wikiversity (Q370) editions mapped to their homepage URL (P856):

$ wdtaxonomy Q370 -i -m P856

Biological taxonomy of mammals (Q7377):

$ wdtaxonomy Q7377 -P P171 --brief

Property constraints (Q21502402) with number of properties that have each constraint:

$ wdtaxonomy Q21502402 -P 279,2302 

As Wikidata is no strict ontology, subproperties are not factored in. For instance this query does not include members of the European Union although P463 is a subproperty of P361.

Parts of (P361) the European Union (Q458):

$ wdtaxonomy Q458 -P P361

A taxonomy of subproperties can be queried like taxonomies of items. The hierarchy property is set to P1647 ("subproperty of") by default:

$ wdtaxonomy P361
$ wdtaxonomy P361 -P P1647  # equivalent

Subproperties of "part of" (P361) and which of them have an inverse property (P1696):

$ wdtaxonomy P361 -P P1647/P1696

Inverse properties are neither factored in so queries like these do not necessarily return the same results:

What hand (Q33767) is part of (P361):

$ wdtaxonomy Q33767 -P 361 -r

What parts the hand (Q33767) has (P527):

$ wdtaxonomy Q33767 -P 527

Options

Query options

brief (-b)

Don't count instance and sites. Same as -S/--no-sitecount and -I/--no-instancecount.

children (-c)

Get direct subclasses only

descr (-d)

Include item descriptions

sparql-endpoint (-e)

SPARQL endpoint to query (default: https://query.wikidata.org/sparql)

instances (-i)

Include instances

no-instancecount (-I)

Don't count number of instances

lang (-l)

Language to get labels in (default: en)

no-labels (-L)

Omit all labels. This allows for querying larger taxonomies (several thousands of classes), especially if combined with option --brief.

mappings (-m)

Lookup mappings based on given comma-separated properties such as P1709 (equivalent class). The following keywords can be used as shortcuts:

  • equal or =: equivalent property (P1628), equivalent class (P1709), and exact match (P2888)
  • broader: external superproperty (P2235)
  • narrower: narrower external class (P3950), external subproperty (P2236)
  • class: properties for mapping classes
  • property: properties for mapping properties
  • all all properties for ontology mapping (instances of Q30249126)

reverse (-r)

Get superclasses instead of subclasses up to the root

no-sitecount (-I)

Don't count number of sites

total (-t)

Count total (transitive) number of instances, including instances of subclasses

post (-p)

Use HTTP POST to disable caching

sparql (-s)

Don't actually perform a query but print SPARQL query and exit

user (-u)

User to the SPARQL endpoint

password (-w)

Password to the SPARQL endpoint

Output options

color (-C)

enable color output if it's disabled (e.g. when output is piped or written to a file)

format (-f)

Output format

json (-j)

Use JSON output format. Same as --format json but shorter.

no-colors (-n)

disable color output

output (-o)

write result to a file given by name

prune (-R)

prune hierarchy to all entries with any of a given criteria plus their broader concepts and all top concepts:

  • mappings: has mappings
  • sites: has sites
  • instances: has instances
  • occurences : has sites or instances

Multiple criteria can be combined as alternatives with comma.

uris (-U)

Show full URIs in output formats, e.g. http://www.wikidata.org/entity/Q1 instead of Q1

verbose (-v)

Show verbose error messages

Output formats

Text format

By default, the taxonomy is printed in "text" format with colored Unicode characters:

$ wdtaxonomy Q17362350
planet of the Solar System (Q17362350) •2 ↑
├──outer planet (Q30014) •25 ×4 ↑↑
└──inner planets (Q3504248) •8 ×4 ↑↑

The output contains item labels, Wikidata identifiers, the number of Wikimedia sites connected to each item (indicated by bullet character ""), the number of instances (property P31), indicated by a multiplication sign "×"), and an upwards arrow ("") as indicator for additional superclasses.

Option "--instances" (or "-i") explicitly includes instances:

$ wdtaxonomy -i Q17362350
planet of the Solar System (Q17362350) •2 ↑
├──outer planet (Q30014) •25 ↑↑
|   -Saturn (Q193)
|   -Jupiter (Q319)
|   -Uranus (Q324)
|   -Neptune (Q332)
└──inner planets (Q3504248) •8 ↑↑
    -Earth (Q2)
    -Mars (Q111)
    -Mercury (Q308)
    -Venus (Q313)

Classes that occur at multiple places in the taxonomy (multihierarchy) are marked like in the following example:

$ wdtaxonomy Q634
planet (Q634) •202 ×7 ↑
├──extrasolar planet (Q44559) •88 ×833 ↑
|  ├──circumbinary planet (Q205901) •15 ×10
|  ├──super-Earth (Q327757) •32 ×46 ↑
...
├──terrestrial planet (Q128207) •70 ×7
|  ╞══super-Earth (Q327757) •32 ×46 ↑ …
...

JSON format

Option --format json serializes the taxonomy as JSON object. The format follows specification of JSKOS Concept Schemes:

{
  "type": [ "http://www.w3.org/2004/02/skos/core#ConceptScheme" ],
  "modified": "2017-11-06T10:25:54.966Z",
  "license": [
    {
      "uri": "http://creativecommons.org/publicdomain/zero/1.0/",
      "notation": [ "CC0" ]
    }
  ],
  "languages": [ "en" ],
  "topConcepts": [
    { "uri": "http://www.wikidata.org/entity/Q17362350" }
  ],
  "concepts": [ ]
}

Field concepts contains an array of all extracted Wikidata entities (usually classes and instances) as JSKOS Concepts:

{
  "uri": "http://www.wikidata.org/entity/Q17362350",
  "notation": [ "Q17362350" ],
  "prefLabel": {
    "en": "planet of the Solar System"
  },
  "scopeNote": {
    "en": [ "inner and outer planets of our solar system" ]
  },
  "broader": [
    { "uri": "http://www.wikidata.org/entity/Q634" }
  ],
  "narrower": [
    { "uri": "http://www.wikidata.org/entity/Q30014" },
    { "uri": "http://www.wikidata.org/entity/Q3504248" }
  ]
}

Instances (option --instances) are linked via field subjectOf the same way as field broader and narrower.

The number of instances and sites, if counted is given as array of JSKOS Concept Occurrences in field occurrences, each identified by subfield relation:

{
  "uri": "http://www.wikidata.org/entity/Q30014",
  "notation": [ "Q30014" ],
  "prefLabel": {
    "en": "outer planet of the Solar system"
  },
  "occurrences": [
    {
      "relation": "http://www.wikidata.org/entity/P31",
      "count": 4
    },
    {
      "relation": "http://schema.org/about",
      "count": 25
    }
  ]
}

Mappings (option --mappings) are stored in field mappings as array of JSKOS Concept Mappings:

[
  {
    "from": {
      "memberSet": [
        { "uri": "http://www.wikidata.org/entity/Q634" }
      ]
    },
    "to": {
      "memberSet": [
        { "uri": "http://dbpedia.org/ontology/Planet" }
      ]
    },
    "type": [
      "http://www.w3.org/2004/02/skos/core#exactMatch",
      "http://www.w3.org/2002/07/owl#equivalentClass",
      "http://www.wikidata.org/entity/P1709"
    ]
  }
]

The mapping type is given in field type with the Wikidata property URI as last array element and the SKOS mapping relation URI as first.

NDJSON format

Option --format ndjson serializes JSON field concepts with one record per line. The order if records is same as in txt, json, and csv format but each concept is only included once.

CSV and TSV format

CSV and TSV format are optimized for comparing differences in time. Each output row consists of five fields:

  • level in the hierarchy indicated by zero or more "-" (default) or "=" characters (multihierarchy).

  • id of the item. Items on the same level are sorted by their id.

  • label of the item. Language can be selected with option --language. The label in csv format is quoted.

  • sites: number of connected sites (Wikipedia and related project editions). Larger numbers may indicate more established concepts.

  • parents outside of the hierarchy, indicated by zero or more "^" characters.

For instance the CSV output for Q634 would be like this:

$ wdtaxonomy -f csv Q634
level,id,label,sites,instances,parents
,Q634,"planet",196,7,^
-,Q44559,"extrasolar planet",81,833,^
--,Q205901,"circumbinary planet",14,10,
--,Q327757,"super-Earth",32,46,
...
-,Q128207,"terrestrial planet",67,7,
==,Q327757,"super-Earth",32,46,
...

In this example there are 196 Wikipedia editions or other sites with an article about planets and seven Wikidata items are direct instance of a planet. At the end of the line "^" indicates that "planet" has one superclass. In the next rows "extrasolar planet" (Q44559) is a subclass of planet with another superclass indicated by "^". Both "circumbinary planet" and "super-Earth" are subclasses of "extrasolar planet". The latter also occurs as subclass of "terrestrial planet" where it is marked by "==" instead of "--".

Usage as module

Add wikidata-taxonomy as dependency to you package.json:

$ npm install wikidata-taxonomy --save

The library provides:

  • queryTaxonomy(id, options) returns a promise with a taxonomy extracted from Wikidata as JSKOS Concept Scheme. See JSON format of the command line client for documentation.

    const { queryTaxonomy } = require('wikidata-taxonomy')
    
    var options = { lang: 'fr', brief: true }
    queryTaxonomy('Q634', lang)
    .then(taxonomy => {
      taxonomy.concepts.forEach(concept => {
        var qid = concept.notation[0]
        var label = (concept.prefLabel || {}).fr || '???'
        console.log('%s %s', qid, label)
      })
    })
    .catch(error => console.error("E",error))

    Options roughly equivalent command line query options:

    • boolean flags brief, children, description, labels, total, instances, instancecount, sitecount, reverse, post
    • SPARQL endpoint configuration with endpoint, user, password
    • language tag language or lang
    • array property (set to ['P279', 'P31'] by default)
    • array or string mappings
  • serializeTaxonomy contains serializers to be called with a taxonomy, an output stream, and optional configuration:

    const { serializeTaxonomy } = require('wikidata-taxonomy')
    
    // serialize taxonomy to stream
    serializeTaxonomy.csv(taxonomy, process.stdout)
    serializeTaxonomy.txt(taxonomy, process.stdout, {colors: true}) // FIXME
    serializeTaxonomy.json(taxonomy, process.stdout)
    serializeTaxonomy.ndjson(taxonomy, process.stdout)

Usage in web applications

Experimental support of this library in web application is given with file wikidata-taxonomy.js in directoy dist. The gh-pages branch contains a sample application, also available at http://jakobvoss.de/wikidata-taxonomy/.

Requires wikidata-sdk and a HTTP client library. The latter can be attached to window.requestPromise (before wikidata-taxonomy is loaded). Axios is detected by default.

<html>
  <head>
    <script src="https://unpkg.com/wikidata-sdk/dist/wikidata-sdk.min.js"></script>
    <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
    <script src="https://unpkg.com/wikidata-taxonomy/dist/wikidata-taxonomy.min.js"></script>
  </head>
  <body>
    ...
  </body>
</html>

Release notes

Release notes are listed in file CHANGES.md in the source code repository.

See Also

build status npm version Documentation Status

This document

Related tools

Publications

More Repositories

1

ditaa-markdown

process ditaa diagrams embedded in pandoc markdown
Perl
67
star
2

wdq

Command line interface to Wikidata Query Service
Perl
52
star
3

marginalia

Extract Annotations from PDF files
Java
19
star
4

tkz-orm

Object-Role Modeling diagrams in TeX
TeX
15
star
5

jq-jsonpointer

jq module implementing JSON Pointer (RFC 6901)
JSONiq
12
star
6

sony-prs-t1

Notes about making use of Sony PRS T1 eReader
Perl
10
star
7

Pandoc-Elements

Perl module to create and process Pandoc documents
Perl
10
star
8

hypertext-timeline

A Timeline of Hypertext Systems
HTML
10
star
9

jq-wikidata

jq module to process Wikidata JSON format
JSONiq
8
star
10

jq-jsonpatch

jq module implementing JSON Patch (RFC 6902)
Shell
8
star
11

token_word

An experimental online literature system featuring deep quotation, deep reuse, and usable micropayments
TeX
6
star
12

Pod-Pandoc

process Plain Old Documentation format with Pandoc
Perl
5
star
13

pandoc-filter-arrows

Replace ASCII arrows by LaTeX arrows
Perl
5
star
14

semantic-copying

4
star
15

RDF-NS

Just use popular RDF namespace prefixes from prefix.cc
C++
4
star
16

App-cloudconvert

convert files via cloudconvert.org
Perl
3
star
17

sgloss

Simple Glossary Management Tool
PHP
3
star
18

App-Padadoy

Simply deploy PSGI web applications
Perl
3
star
19

wikidata-graphs

Experimenting with graphs from Wikidata content
Perl
3
star
20

lua-edtf

experimental Lua parser for Extended Date/Time Format (EDTF)
Lua
3
star
21

wikidata-lua-client

Experiments in accessing Wikidata with Lua
Python
2
star
22

visual-meta

notes and implementations around Visual-Meta
TeX
2
star
23

RDF-Dumper

Perl module to dump RDF data objects
Perl
2
star
24

reconcile-cli

Reconciliation Service API command line cient
Shell
2
star
25

Pandoc-Wrapper

Perl wrapper for the mighty Pandoc document converter
Perl
2
star
26

extdatetime

Extended Date/Time in Haskell
Haskell
2
star
27

Plack-Middleware-RDF-Flow

Simplified Linked Data provider
Perl
2
star
28

RDF-Lazy

Lazy typing access to RDF data
Perl
2
star
29

RDF-Flow

RDF data flow pipeline
Perl
2
star
30

Plack-App-DAIA

DAIA Server as Plack application
Perl
2
star
31

Plack-Middleware-Cached

glues a cache to your PSGI application
Perl
2
star
32

RDF-Trine-Exporter-GraphViz

Serialize RDF graphs as dot graph diagrams
Perl
2
star
33

serd

A fork of David Robillard's fast Turtle and NTriples parser and serializer
C
2
star
34

hypertext

A repository for several small hypertext projects, including Docuplex and hyperlit
TeX
2
star
35

badnames

Git repository with possible filenames that may cause problems.
Perl
2
star
36

Plack-App-RDF-Files

Serve RDF data from files
Perl
2
star
37

XML-Struct

Convert document-oriented XML to data structures, preserving element order
Perl
2
star
38

Plack-App-GitHub-WebHook

GitHub WebHook receiver as Plack application
Perl
2
star
39

noxanadu

this is not Xanadu
1
star
40

Turtle-Writer

Write RDF/Turtle documents without non-core package dependencies
Perl
1
star
41

RDF-JAOS

Just Another Ontology Server
Perl
1
star
42

duplofab

Some experiments with Lego DUPLO and personal fabrication
1
star
43

perlrdf-bootcamp

1
star
44

Plack-Middleware-Negotiate

HTTP content negotiation as Plack Middleware
Perl
1
star
45

Test-JSON-Entails

Test whether one JSON structure entails another
Perl
1
star
46

skos-browser

Browseable Linked Data interface to SKOS vocabularies
Perl
1
star
47

Plack-App-unAPI

Serve via unAPI
Perl
1
star
48

Beacon

A simple link aggregation file format
Perl
1
star
49

Plack-App-SeeAlso

SeeAlso Server as PSGI application
JavaScript
1
star
50

Config-Any-CSV

Load CSV files as config files
Perl
1
star
51

wikidata-taxonomy-examples

Extract classifications from Wikidata
Perl
1
star
52

Catmandu-Importer-getJSON

Load JSON data via HTTP GET with Catmandu
Perl
1
star
53

RDF-Light

Perl modules to easily create RDF-based web applications
Perl
1
star
54

se2skos

Create Knowledge Organization Systems in SKOS from StackExchange tags
Perl
1
star
55

CommonsStockPhoto

Easy reuse of images from Wikimedia Commons
PHP
1
star
56

App-Run

Boilerplate for Applications
Perl
1
star
57

libcview

Visualization of library classifications
JavaScript
1
star
58

Punched-Tape

Punched Tape in Perl!
Perl
1
star
59

wikidata-identifier

Shell
1
star
60

swib2011mojolicious

Perl
1
star
61

Business-PLZ

Validate German postal codes and map them to states
Perl
1
star
62

SeeAlso-Server

SeeAlso perl module
Perl
1
star
63

ideas

Public ideas collected as GitHub issues
Shell
1
star
64

Plack-Middleware-REST

Route PSGI requests for RESTful web applications
Perl
1
star
65

qid

Look up Wikidata items via command line
Perl
1
star
66

IIIF

IIIF Image API implementation in Perl based on ImageMagick
Perl
1
star
67

skos-simple

Create simple SKOS data with entailment
Perl
1
star
68

acroterm

LaTeX package to manage acronyms and terms
1
star
69

Me_Lib

Makeblock Me-Series electronic modules library as available at http://www.makeblock.cc/download/
C++
1
star
70

african-cabinets

JavaScript
1
star
71

Data-Beacon

Data::Beacon package in Perl 5 - BEACON format parser and serializer
Perl
1
star
72

Plack-Middleware-JSONP-Headers

Return JSONP with HTTP headers as done by github APIv3
Perl
1
star
73

home

my home directory
Vim Script
1
star
74

spieluhr

Perl
1
star
75

living-lamp

project to create an LED panel
JavaScript
1
star
76

Plack-Middleware-CHI

customizable cache for PSGI applications
Perl
1
star
77

dblis

Digital bibliography of library & information science
TeX
1
star
78

xanadoc-demo

Minimal hypertext system
Perl
1
star
79

ulogbufd

copy of http://www.panix.com/userdirs/jdw/ulogbufd/
Perl
1
star
80

diml-xsl

Stylesheets for the Dissertation Markup Language (DiML)
1
star
81

wikidata-dumps

Scripts for processing Wikidata JSON dumps
Perl
1
star