• Stars
    star
    850
  • Rank 53,229 (Top 2 %)
  • Language
    Scala
  • Created over 11 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The software used to extract structured data from Wikipedia

DBpedia Information Extraction Framework

Extraction Framework Build and MiniDump Test

Homepage: http://dbpedia.org
Documentation: http://dev.dbpedia.org/Extraction
Get in touch with DBpedia: https://wiki.dbpedia.org/join/get-in-touch
Slack: join the #dev-team slack channel within the the DBpedia Slack workspace - the main point for developement updates and discussions

Contents

About DBpedia

DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.
To check out the projects of DBpedia, visit the official DBpedia website.

Getting Started

The Easy Way - Execution using the MARVIN release bot

Running the extraction framework is a relatively complex task which is in details documented in the advanced QuickStart guide. To run the extraction process same as the DBpedia core team does, you can do using the MARVIN release bot. The MARVIN bot automates the overall extraction process, from downloading the ontology, mappings and Wikipedia dumps, to extraction and post-processing the data.

git clone https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
cd marvin-config
./setup-or-reset-dief.sh
# test run Romanian extraction, very small
./marvin_extraction_run.sh test
# around 4-7 days
./marvin_extraction_run.sh generic

Standalone Execution

If you plan to work on improving the codebase of the framework you would need to run the extraction framework alone as described in the QuickStart guide. This is highly recommended, since during this process you will learn a lot about the extraction framework.

  • Extractors represent the core of the extraction framework. So far, many extractors have been developed for extraction of particular information from different Wikimedia projects. To learn more, check the New Extractors guide, which explains the process of writing new extractor.

  • Check the Debugging Guide and learn how to debug the extraction framework.

Execution using Apache Spark

In order to speed up the extraction process, the extraction framework has been adopted to run on Apache Spark. Currently, more than half of the extractors can be executed using Spark. The extraction process using Spark is a slightly different process and requires different Execution. Check the QuickStart guide on how to run the extraction using Apache Spark.

Note: if possible, new extractors should be implemented using Apache Spark. To learn more, check the New Extractors guide, which explains the process of writing new extractor.

The DBpedia Extraction Framework

The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia extraction framework is written using Scala 2.8. The framework is available from the DBpedia Github repository (GNU GPL License). The change log may reveal more recent developments. More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki

The DBpedia extraction framework is structured into different modules

  • Core Module : Contains the core components of the framework.
  • Dump extraction Module : Contains the DBpedia dump extraction application.

Core Module

http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png

Components

  • Source : The Source package provides an abstraction over a source of Media Wiki pages.
  • WikiParser : The Wiki Parser package specifies a parser, which transforms an Media Wiki page source into an Abstract Syntax Tree (AST).
  • Extractor : An Extractor is a mapping from a page node to a graph of statements about it.
  • Destination : The Destination package provides an abstraction over a destination of RDF statements.

In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:

Dump extraction Module

More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki/Extraction-Instructions.

To know more about the extraction framework, click here

Contribution Guidelines

If you want to work on one of the issues, assign yourself to it or at least leave a comment that you are working on it and how.
If you have an idea for a new feature, make an issue first, assign yourself to it, then start working.
Please make sure you have read the Developer's Certificate of Origin, further down on this page!

  1. Fork the main extraction-framework repository on GitHub.
  2. Clone this fork onto your machine (git clone <your_repo_url_on_github>).
  3. Switch to the dev branch (git checkout dev).
  4. From the latest revision of the dev branch, make a new development branch from the latest revision. Name the branch something meaningful, for example fixRestApiParams (git checkout dev -b fixRestApiParams).
  5. Make changes and commit them to this branch.
  • Please commit regularly in small batches of things "that go together" (for example, changing a constructor and all the instance creating calls). Putting a huge batch of changes in one commit is bad for code reviews.
  • In the commit messages, summarize the commit in the first line using not more than 70 characters. Leave one line blank and describe the details in the following lines, preferably in bullet points, like in 7776e31....
  1. When you are done with a bugfix or feature, rebase your branch onto extraction-framework/dev (git pull --rebase git://github.com/dbpedia/extraction-framework.git). Resolve possible conflicts and commit.
  2. Push your branch to GitHub (git push origin fixRestApiParams).
  3. Send a pull request from your branch into extraction-framework/dev via GitHub.
  • In the description, reference the associated commit (for example, "Fixes #123 by ..." for issue number 123).
  • Your changes will be reviewed and discussed on GitHub.
  • In addition, Travis-CI will test if the merged version passes the build.
  • If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
  • When everything is fine, your changes will be merged into extraction-framework/dev, finally the dev together with your improvements will be merged with the master branch.

Please keep in mind:

More tips:

Important: Developer's Certificate of Origin

By sending a pull request to the extraction-framework repository on GitHub, you implicitly accept the Developer's Certificate of Origin 1.1

License

The source code is under the terms of the GNU General Public License, version 2.

More Repositories

1

fact-extractor

Fact Extraction from Wikipedia Text
Python
528
star
2

lookup

Outputs a list of ranked DBpedia resources for a search string.
Scala
185
star
3

virtuoso-sparql-endpoint-quickstart

creates a docker image with Virtuoso preloaded with the latest DBpedia dataset
Shell
118
star
4

chatbot

DBpedia Chatbot
Java
103
star
5

dbpedia

Various tools for the DBpedia project - This does NOT contain the DBpedia extaction framework
PHP
97
star
6

embeddings

Knowledge Base Embeddings for DBpedia
Python
86
star
7

links

A repo that contains outgoing links from DBpedia
Java
50
star
8

dbpedia-lookup

A generic entity retrieval service for linked data. Contains presets to replicate the DBpedia Lookup service.
Java
41
star
9

distributed-extraction-framework

DBpedia Distributed Extraction Framework: Extract structured data from Wikipedia in a parallel, distributed manner
Web Ontology Language
41
star
10

ontology-driven-api

An ontology-driven RESTstyle API for DBpedia backed by an external SPARQL endpoint
Java
40
star
11

databus

A digital factory platform for managing files online with stable IDs, high-quality metadata, powerful API and tools for building on data: find, access, make interoperable, re-use
JavaScript
39
star
12

GSoC

Google Summer of Code organization
37
star
13

ontology-tracker

Here we keep track of modification requests in the DBpedia Ontology
Java
35
star
14

DataId-Ontology

The DBpedia DataID vocabulary is a metadata system for detailed descriptions of datasets and their physical instances, as well as their relation to agents like persons or organizations in regard to their rights and responsibilities.
HTML
35
star
15

table-extractor

Extract Data from Wikipedia Tables
Python
32
star
16

list-extractor

Extract Data from Wikipedia Lists
Python
30
star
17

dbpedia-live-mirror

Keeps a mirror of DBpedia live in sync
Java
26
star
18

dbpedia-docs

A tutorial about DBpedia and Linked Data in general
Shell
23
star
19

neural-rdf-verbalizer

πŸ—£ Multilingual RDF Verbalizer – Google Summer of Code 2019
Python
21
star
20

mappings-autogeneration

Tools & scripts to infer new Wikipedia infobox to ontology mappings
Python
21
star
21

gsoc-2020-dashboard

Python
20
star
22

archivo

DBpedia Archivo - Augmented Ontology Archive powered by Databus
Python
20
star
23

neural-extraction-framework

Repository for the GSoC project 'Towards a Neural Extraction Framework'
Jupyter Notebook
16
star
24

webid

WebID Creation and Validation (Tutorial, Tools, Best practices)
PHP
15
star
25

databus-client

Scala
14
star
26

dbpedia-links

moved to https://github.com/dbpedia/links
13
star
27

topicmodel-extractor

A repository for the "Combining DBpedia and Topic Modeling" GSoC 2016 idea
Java
13
star
28

sci-graph-links

Linking DBpedia to SciGraph
Shell
13
star
29

RDF2text-GAN

RDF -to- text generator, using GANs and reinforcement learning. For Google summer of code 2020.
Jupyter Notebook
13
star
30

dataid

The DBpedia Data ID Unit is a DBpedia Group with the goal of describing LOD datasets via RDF files, to host and deliver these metadata files together with the dataset in a uniform way, create and validate such files and deploy the results for the DBpedia and its local chapters.
JavaScript
13
star
31

dbpedia-wiktionary

Precompiled executables, config files and working examples for http://dbpedia.org/Wiktionary
11
star
32

dev.dbpedia.org

Developer Documentation at http://dev.dbpedia.org
CSS
10
star
33

mappings-ui

DBpedia RML mappings management frontend
JavaScript
10
star
34

gfs

DBpedia, which frequently crawls and analyses over 120 Wikipedia language editions has near complete information about (1) which facts are in infoboxes across all Wikipedias (2) where Wikidata is already used in those infoboxes. GlobalFactSyncRE will extract all infobox facts and their references to produce a tool for Wikipedia editors that detects and displays differences across infobox facts in an intelligent way to help sync infoboxes between languages and/or Wikidata. The extracted references will also be used to enhance Wikidata. Click Join below to receive GFS updates via {{ping}} to your Wikiaccount.
Jupyter Notebook
10
star
35

mappings-tracker

This project is used for tracking mapping issues in mappings.dbpedia.org
9
star
36

linking

Workflow for linking external datasets to DBpedia.
Python
9
star
37

dbpedia-vad-i18n

Virtuoso plugin for the serving Linked Data
JavaScript
9
star
38

event-extractor

Repository for the DBpedia GSoC Hybrid Classifier/Rule-based Event Extractor Project
Java
8
star
39

wikidata-mapper

Automated Wikidata mappings to DBpedia ontology GSoC 2014 Project
7
star
40

keyword-search

keyword-search
Java
7
star
41

jsonpedia-extractor

Fine grained massive extraction of Wiipedia content GSoC 2014 Project
JavaScript
6
star
42

databus-maven-plugin

Databus Maven Plugin: Aligning Data and Software Lifecycle with Maven
Scala
6
star
43

cmem-plugin-databus

eccenca Corporate Memory build plugin to publish and load datasets from a DBpedia databus service.
Python
5
star
44

fusion

algorithms to fuse dbpedia
Java
5
star
45

predicate-finder

Python
5
star
46

mappings_chrome_extension

A chrome extension for generating new mappings
JavaScript
5
star
47

databus-mods

Databus Mods (How To and Mod Ontology and Reference Implementation)
Scala
5
star
48

mapping-tool

A GUI for mapping Wikipedia Infoboxes to the DBpedia ontology
JavaScript
4
star
49

dbpedia-chatbot-backend

HTML
4
star
50

marvin-config

Public configuration files for MARVIN - the DBpedia Knowledge Graph extraction and release bot - running on the TIB servers
HTML
4
star
51

stack-tutorial-resources

Resource for the DBpedia Stack Tutorial
XSLT
4
star
52

nlp-dbpedia

Free, open and interoperable (FOI) NLP benchmarks used for and by DBpedia
4
star
53

Multilingual-RDF-Verbalizer

PLSQL
4
star
54

gstore

Git repo / triple store hybrid graph storage
Scala
4
star
55

chatbot-ng

Repository for the GSoC 2021 project 'Modular DBpedia Chatbot'.
JavaScript
4
star
56

dbpedia-webprotege

A webprotege deployment for editing the DBpedia ontology
Java
4
star
57

ontology-time-machine

Python
4
star
58

social-knowledge-graph

Repository for the GSoC 2021 project 'Social Knowledge Graph'.
Jupyter Notebook
4
star
59

quad-processor-util

A handy library for reading & mapping multiple N-Triple (or N-Quad) files at once.
Scala
3
star
60

dnkg-pilot

Dutch National Knowledge Graph Pilot
Shell
3
star
61

DBTax

DBTax project
Java
3
star
62

DBpediaAI

The DBpedia AI project
3
star
63

dbpedia-wiktionary-configuration

The configuration of DBpedia Wiktionary
3
star
64

MissingBot

Java
3
star
65

DBpedia-LiveNeural-Chatbot

The DBpedia Live Neural Chatbot
Python
3
star
66

tablist-extractor

Fusion of the table and list extractors
Python
3
star
67

dbpedia-databus-collection-downloader

Java
2
star
68

gsoc-dbpedia-dashboard

JavaScript
2
star
69

dbpedia-widgets

Simple embed-able widgets
Python
2
star
70

WorldFacts

2
star
71

tutorials

Shell
2
star
72

healthcare-platform

Repository for the GSoC 2021 project 'Update DBpedia SPARQL for Wiki Resources Related to Pandemic, Healthcare, and Health AI Fields'.
Jupyter Notebook
2
star
73

format-mappings

Dev repo towards a knowledge library for format mappings
2
star
74

dbpedia-chatbot-data

Jupyter Notebook
2
star
75

databus-python-client

Python
2
star
76

media-extractor

DBpedia support for multimedia data sources other than Wikipedia. GSoC 2014 project.
Clojure
1
star
77

dbpedia-live-update-viewer

JavaScript
1
star
78

DBpedia-Spotlight-Dashboard

An integrated statistical information tool from the Wikipedia dumps and the DBpedia Extraction Framework artifacts
Python
1
star
79

link-based-complementary-fusion

A light-weight tool to fuse complementary facts to DBpedia identifiers
Java
1
star
80

databus-transfer

Transfer published data to a new Databus
JavaScript
1
star
81

databus-moss

Databus Metadata Overlay Search System
Java
1
star
82

events

DBpedia Events
Java
1
star
83

databus-shared-lib

Scala
1
star
84

databus-moss-frontend

Databus MOSS Frontend
Svelte
1
star
85

databus-moss-docker

Docker Setup for MOSS
1
star
86

wall-of-fame

A SHACLOntology and several tools to attribute contributions to the DBpedia movement to individual DBpedians, i.e. give credit for their merit in a machine readable format (RDF).
Scala
1
star
87

community-viewer

a simple interface that displays the DBpedia community
1
star