• Stars
    star
    177
  • Rank 215,985 (Top 5 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 9 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Implementation with some extensions of the paper "Snowball: Extracting Relations from Large Plain-Text Collections" (Agichtein and Gravano, 2000)

python   example event parameter   code coverage   Checked with mypy   Code style: black   License: GPL v3   Pull Requests Welcome

Snowball: Extracting Relations from Large Plain-Text Collections

An implementation of Snowball, a relationship extraction system that uses a bootstrapping/semi-supervised approach, it relies on an initial set of seeds, i.e. paris of named-entities representing relationship type to be extracted.

Extracting companies headquarters:

The input text needs to have the named-entities tagged, like show in the example bellow:

The tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.
<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.
<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.
<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.
<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.
<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.
<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.

We need to give seeds to bootstrap the extraction process, specifying the type of each named-entity and relationships examples that should also be present in the input text:

e1:ORG
e2:LOC

Lufthansa;Cologne
Nokia;Espoo
Google;Mountain View
DoubleClick;New York
SAP;Walldorf

To run a simple example, download the following files

- sentences_short.txt.bz2
- seeds_positive.txt

Install Snowball using pip

pip install snowball-extractor

Run the following command:

snowball --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6

After the process is terminated an output file relationships.jsonl is generated containing the extracted relationships.

You can pretty print it's content to the terminal with: jq '.' < relationships.jsonl:

{
  "entity_1": "Medtronic",
  "entity_2": "Minneapolis",
  "confidence": 0.9982486865148862,
  "sentence": "<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . ",
  "bef_words": "",
  "bet_words": ", based in",
  "aft_words": ", is",
  "passive_voice": false
}

{
  "entity_1": "DynCorp",
  "entity_2": "Reston",
  "confidence": 0.9982486865148862,
  "sentence": "Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .",
  "bef_words": "Because",
  "bet_words": ", headquartered in",
  "aft_words": ", Va.",
  "passive_voice": false
}

{
  "entity_1": "Handspring",
  "entity_2": "Silicon Valley",
  "confidence": 0.893486865148862,
  "sentence": "There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .",
  "bef_words": "firms like",
  "bet_words": ", a company based in",
  "aft_words": "that looks",
  "passive_voice": false
}

Snowball has several parameters to tune the extraction process, in the example above it uses the default values, but these can be set in the configuration file: parameters.cfg

max_tokens_away=6           # maximum number of tokens between the two entities
min_tokens_away=1           # minimum number of tokens between the two entities
context_window_size=2       # number of tokens to the left and right of each entity

alpha=0.2                   # weight of the BEF context in the similarity function
beta=0.6                    # weight of the BET context in the similarity function
gamma=0.2                   # weight of the AFT context in the similarity function

wUpdt=0.5                   # < 0.5 trusts new examples less on each iteration
number_iterations=3         # number of bootstrap iterations
wUnk=0.1                    # weight given to unknown extracted relationship instances
wNeg=2                      # weight given to extracted relationship instances
min_pattern_support=2       # minimum number of instances in a cluster to be considered a pattern

and passed with the argument --config=parameters.cfg.

The full command line parameters are:

  -h, --help            show this help message and exit
  --config CONFIG       file with bootstrapping configuration parameters
  --sentences SENTENCES
                        a text file with a sentence per line, and with at least two entities per sentence
  --positive_seeds POSITIVE_SEEDS
                        a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'
  --negative_seeds NEGATIVE_SEEDS
                        a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'
  --similarity SIMILARITY
                        the minimum similarity between tuples and patterns to be considered a match
  --confidence CONFIDENCE
                        the minimum confidence score for a match to be considered a true positive
  --number_iterations NUMBER_ITERATIONS
                        the number of iterations the run

In the first step it pre-processes the input file sentences.txt generating word vector representations of
relationships (i.e.: processed_tuples.pkl).

This is done so that then you can experiment with different seed examples without having to repeat the process of generating word vectors representations. Just pass the argument --sentences=processed_tuples.pkl instead to skip this generation step.

You can find more details about the original system here:

For details about this particular implementation and how it was used, please refer to the following publications:

Contributing to Snowball

Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of Snowball, please read the following guidelines.

The contribution process at a glance

  1. Preparing the development environment
  2. Code away!
  3. Continuous Integration
  4. Submit your changes by opening a pull request

Small fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in an issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer.

Preparing the development environment

Make sure you have Python3.9 installed on your system

macOs

brew install [email protected]
python3.9 -m pip install --user --upgrade pip
python3.9 -m pip install virtualenv

Clone the repository and prepare the development environment:

git clone [email protected]:davidsbatista/Snowball.git
cd Snowball            
python3.9 -m virtualenv venv         # create a new virtual environment for development using python3.9 
source venv/bin/activate             # activate the virtual environment
pip install -r requirements_dev.txt  # install the development requirements
pip install -e .                     # install Snowball in edit mode

Continuous Integration

Snowball runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a full test suite is run on your PR:

  • The code is formatted using black and isort
  • Unused imports are auto-removed using pycln
  • Linting is done using pyling and flake8
  • Type checking is done using mypy
  • Tests are run using pytest

Nevertheless, if you prefer to run the tests & formatting locally, it's possible too.

make all

Opening a Pull Request

Every PR should be accompanied by short description of the changes, including:

  • Impact and motivation for the changes
  • Any open issues that are closed by this PR

Give a ⭐️ if this project helped you!

More Repositories

1

Annotated-Semantic-Relationships-Datasets

A collections of public and free annotated datasets of relationships between entities/nominals (Portuguese and English)
683
star
2

NER-datasets

Datasets to train supervised classifiers for Named-Entity Recognition in different languages (Portuguese, German, Dutch, French, English)
Python
337
star
3

NER-Evaluation

An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
Python
213
star
4

BREDS

"Bootstrapping Relationship Extractors with Distributional Semantics" (Batista et al., 2015) in EMNLP'15 - Python implementation
Python
145
star
5

Aspect-Based-Sentiment-Analysis

Aspect-Based Sentiment Analysis Experiments
Python
133
star
6

text-classification

An example on how to train supervised classifiers for multi-label text classification using sklearn pipelines
Jupyter Notebook
110
star
7

ConvNets-for-Sentence-Classification

"Convolutional Neural Networks for Sentence Classification" (Kim 2014) - https://www.aclweb.org/anthology/D14-1181
Jupyter Notebook
54
star
8

machine-learning-notebooks

Assorted exercises and proof-of-concepts to understand and study machine learning and statistical learning theory
Jupyter Notebook
44
star
9

lexicons

Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc., which I gathered for different experiments.
29
star
10

TAC-Entity-Linking

An entity linking prototype, developed using the datasets from the TAC-KBP sub-task.
Java
28
star
11

awesome-Portuguese-NLP

A list of libraries and NLP projects for Portuguese
19
star
12

information-extraction-PT

An example of triples extraction with PoS-tags using ReVerb
Python
16
star
13

REACTION-resources

Resources developed by and for the project REACTION (Retrieval, Extraction and Aggregation Computing Technology for Integrating and Organizing News) an initiative for developing a computational journalism platform (mostly) for Portuguese.
9
star
14

StanfordNER-experiments

Python
8
star
15

SLANG-Sequence-LAbeliNG

Sequence LAbeliNG with Neural Networks: "Neural Architectures for Named Entity Recognition" (Lample et al., 2016) and "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" (Ma, 2016)
Jupyter Notebook
7
star
16

Toponym-Disambiguation-Using-Ontology-Based-Semantic-Similarity

Toponym Disambiguation using Ontology-based Semantic Similarity.
Python
6
star
17

coding-exercises

A repository of coding interview questions and solutions
Python
6
star
18

MuSICo

A Minwise Hashing Method for Addressing Relationship Extraction from Text
Java
5
star
19

bash-shell-utils

bash scripts, sed examples, and other stuff that I need from time to time
Shell
4
star
20

Temporal-Information-Datasets

3
star
21

minhash-classifier

supervised relationship extraction based on min-hash and locality sensitive hashing
Python
3
star
22

NER-English-Gigaword-LDC

Python scripts to parse the Gigaword collection and perform NER tagging with StanfordNER
Python
3
star
23

dbpedia-webapps

Simple webapps, relying on DBpedia as a data-source.
Python
3
star
24

GermEval-2019-Task_1

GermEval 2019 Task 1 - Shared Task on Hierarchical Classification of Blurbs
Python
2
star
25

ml-report-kit

A plug-in to generate various evaluation metrics and reports ( PR-curves, classifications reports, confusion matrix) for supervised machine learning models using only two lines of code.
Python
2
star
26

nostalgia

Old projects of mine, done during high-school or university and found in old hard-drives
HTML
1
star
27

politiquices

Explore relações de apoio e oposição, entre personalidades políticas, expressas em títulos de notícias preservadas no arquivo.pt
1
star
28

Snowball-Java

Snowball: Extracting Relations from Large Plain-Text Collections
Java
1
star
29

GermEval-2017-Aspect-Based-Sentiment-Analysis

Python
1
star
30

davidsbatista.net

my personal homepage and blog
HTML
1
star