• Stars
    star
    159
  • Rank 235,916 (Top 5 %)
  • Language
    Java
  • Created almost 14 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.

pignlproc

This project is archived.

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.

Project status

This project is alpha / experimental code. Features are implemented when needed.

Some preliminary results are available in this blog post:

Building from source

Install maven (tested with 2.2.1) and java jdk 6, then:

$ mvn assembly:assembly

This should download the dependencies, build a jar in the target/ subfolder and run the tests.

Usage

The following introduces some sample scripts to demo the User Defined Functions provided by pignlproc for some practical Wikipedia mining tasks.

Those examples demo how to use pig on your local machine on sample files. In production (with complete dumps) you might want to startup a real Hadoop cluster, upload the dumps into HDFS, adjust the above paths to match your setup and remove the '-x local' command line parameter to tell pig to use your Hadoop cluster.

The pignlproc wiki provides comprehensive documentation on where to download the dumps from and how to setup a Hadoop cluster on EC2 using Apache Whirr.

Extracting links from a raw Wikipedia XML dump

You can take example on the extract-links.pig example script:

$ pig -x local \
  -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \
  -p LANG=fr \
  -p INPUT=src/test/resources/frwiki-20101103-pages-articles-sample.xml \
  -p OUTPUT=/tmp/output \
  examples/extract_links.pig

Building a NER training / evaluation corpus from Wikipedia and DBpedia

The goal of those samples scripts is to extract a pre-formatted corpus suitable for the training of sequence labeling algorithms such as MaxEnt or CRF models with OpenNLP, Mallet or crfsuite.

To achieve this you can run time following scripts (splitted into somewhat independant parts that store intermediate results to avoid recomputing everything from scratch when you can the source files or some parameters.

The first script parses a wikipedia dump and extract occurrences of sentences with outgoing links along with some ordering and positioning information:

$ pig -x local \
  -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \
  -p LANG=en \
  -p INPUT=src/test/resources/enwiki-20090902-pages-articles-sample.xml \
  -p OUTPUT=workspace \
  examples/ner-corpus/01_extract_sentences_with_links.pig

The parser has been measured to run at a processing of 1MB/s on in local mode on a MacBook Pro of 2009.

The second script parses dbpedia dumps assumed to be in the folder /home/ogrisel/data/dbpedia:

$ pig -x local \
  -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \
  -p LANG=en \
  -p INPUT=/home/ogrisel/data/dbpedia \
  -p OUTPUT=workspace \
  examples/ner-corpus/02_dbpedia_article_types.pig

This step should complete in a couple of minutes in local mode.

This script could be adapted / replaced to use other typed entities knowledge bases linked to Wikipedia with downloadable dumps in NT or TSV formats; for instance: freebase or Uberblic.

The third script merges the partial results of the first two scripts and order back the results by grouping the sentences of the same article together to be able to build annotated sentences suitable for OpenNLP for instance:

$ pig -x local \
  -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \
  -p INPUT=workspace \
  -p OUTPUT=workspace \
  -p LANG=en \
  -p TYPE_URI=http://dbpedia.org/ontology/Person \
  -p TYPE_NAME=person \
  examples/ner-corpus/03bis_filter_join_by_type_and_convert.pig

$ head -3 workspace/opennlp_person/part-r-00000
The Table Talk of <START:person> Martin Luther <END> contains the story of a 12-year-old boy who may have been severely autistic .
The New Latin word autismus ( English translation autism ) was coined by the Swiss psychiatrist <START:person> Eugen Bleuler <END> in 1910 as he was defining symptoms of schizophrenia .
Noted autistic <START:person> Temple Grandin <END> described her inability to understand the social communication of neurotypicals , or people with normal neural development , as leaving her feeling "like an anthropologist on Mars " .

Building a document classification corpus

TODO: Explain howto extract bag of words or ngrams and document frequency features suitable for document classification using a SGD model from Mahout for instance.

License

Copyright 2010 Nuxeo and contributors:

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

parallel_ml_tutorial

Tutorial on scikit-learn and IPython for parallel machine learning
Jupyter Notebook
1,589
star
2

notebooks

Some sample IPython notebooks for scikit-learn
Jupyter Notebook
556
star
3

pygbm

Experimental Gradient Boosting Machines in Python with numba.
Python
178
star
4

python-appveyor-demo

Demo project for building Python wheels with appveyor.com
PowerShell
153
star
5

docker-distributed

Experimental docker-compose setup to bootstrap distributed on a docker-swarm cluster.
Shell
92
star
6

spylearn

Repo for experiments on pyspark and sklearn
Python
79
star
7

paper2ebook

Utility to re-structure research papers published in US Letter or A4 format PDF files to typically remove the 2 columns layout.
Java
53
star
8

text-mining-class

Introduction to web scraping and text mining
Python
43
star
9

dbpediakit

Python utilities to do work with the DBpedia dumps for analytics.
Python
38
star
10

euroscipy-2022-time-series

Tutorial on time-series forcasting with scikit-learn
Jupyter Notebook
32
star
11

wheelhouse-uploader

Script to help maintain a wheelhouse folder on a cloud storage.
Python
31
star
12

my-linux-devbox

Vagrant / Salt configuration with Ubuntu to work on projects related to the scipy stack under Python 3 and Python 2
Scheme
26
star
13

oglearn

ogrisel's utility extensions for scikit-learn
Python
24
star
14

eegssl

Experiments on Self-Supervised Learning on EEG data
Python
16
star
15

mahout

Personal development repository to prepare contributions and patches for Apache Mahout
Java
15
star
16

euroscipy_2017_sklearn

Notebooks for the EuroScipy 2017 tutorial (based on Adult Census income data)
Jupyter Notebook
15
star
17

corpusmaker

clojure utilities to build training corpora for machine learning / NLP out of public wikimedia dumps: status - partially stalled - will probably be reworked as cascalog scripts -- this project is in stalled mode right now: the pignlproc project is likely to replace it due to licensing constraints for future integration in Apache projects
Clojure
14
star
18

python-winbuilder

Tools to script a build environment on Windows for Python project
Python
9
star
19

codemaker

Neural nets-based utility to build low dimensional codes or/and sparse codes
Python
9
star
20

pycon-pydata-sprint

Experimental work for using IPython.parallel with scikit-learn
Python
8
star
21

salt-ipcluster

Salt states and modules to setup an IPython cluster
Scheme
7
star
22

docker-openblas

Docker container with an automated build for OpenBLAS stable branch:
Shell
5
star
23

stanbol-isbn

Demo stanbol extension for detecting and linking ISBN in text document
Java
5
star
24

silva

Leaf recognition prototype
4
star
25

bbuzz-semantic-hackathon

Sandbox for the Berlin Buzzwords semantic hackathon
Java
3
star
26

research

Draft research notes, code and todos
Jupyter Notebook
3
star
27

scikit-learn-github-actions

Test repo for github actions workflows
Python
2
star
28

ipython-azure

Utilities to deploy a IPython parallel cluster on Windows Azure
Python
2
star
29

lsh_glove

Script to build various LSH / ANN indices on glove word embeddings
Python
2
star
30

cardice

Cloud compute cluster setup with SaltStack
Python
2
star
31

energy_charts

Jupyter Notebook
2
star
32

decks

Slide decks for conferences
CSS
2
star
33

brain2vec

Brain embedding by contextual predictions (draft)
Python
2
star
34

instrumentalist

Python scripts to read XBee sensor data and push it to a couchdb database
Python
2
star
35

mnist-sbi

Simulation Based Inference for the important problem of drawing digits
Python
2
star
36

scikit-learn.org

Source repository to build the HTML website for the scikit-learn project.
Python
1
star
37

camera-html5

Test repo for HTML5 camera access on mobile phones
1
star
38

sandbox

1
star
39

cpython-nightly

Automated build of the master branch of CPython for Continuous Integration purposes
1
star
40

docker-sklearn-openblas

Shell
1
star
41

us-housing-prices-v2-parquet

Exploratory Data Analysis on a parquet dump of https://www.dolthub.com/repositories/dolthub/us-housing-prices-v2 using duckdb and Ibis
Jupyter Notebook
1
star