About
This is an implementation using (linear chain) conditional random fields (CRF) in python 2.7 for named entity recognition (NER). It uses the python-crfsuite library as its basis. By default it can handle the labels PER
, LOC
, ORG
and MISC
, but was primarily optimized for PER
(recognition of names of people) in german, though it should be usable for any language. Scores are expected to be a bit lower for other labels than PER
, because the Gazetteer-feature currently only handles PER
labels. The implementation achieved an F1 score for PER
of 0.78
on the Germeval2014NER corpus (notice that german NER is significantly harder than english NER) and an F1 score of 0.87
(again PER
) on an automatically annotated Wikipedia corpus (it was trained on an excerpt of that Wikipedia corpus, so a higher score was expected as the Germeval2014Ner is partly different from Wikipedia's style of language).
Used features
The CRF implementation uses only local features (i.e. annotating John
at the top of a document with PER
has no influence on another John
at the bottom of the same document).
The used features are:
- Whether to word starts with an uppercase letter
- Chracter length of a word
- Whether the word contains any digit (0-9)
- Whether the word contains any punctuation, i.e.
. , : ; ( ) [ ] ? !
- Whether the word contains only digits
- Whether the word contains only punctuation
- The word2vec cluster of the word (add
-classes
flag to the word2vec tool) - The brown cluster of the word
- The brown cluster bitchain of the word (i.e. the position of the word's brown cluster in the tree of all brown clusters represented by a string of
1
and0
) - Whether the word is contained in a Gazetteer of person names. The Gazetteer is created by scanning through an annotated corpus and collecting all names (words labeled with
PER
) that appear more often among the person names than among all words. - The word pattern of the word, e.g.
John
becomesAa+
,DARPA
becomesA+
- The unigram rank of the word among the 1000 most common words, where the most common word would get the rank
1
(words outside the rank of 1000 just get a-1
). - The 3-character-prefix of the word, i.e.
John
becomesJoh
. - The 3-chracter-suffix of the word, i.e.
John
becomesohn
. - The Part of Speech tag (POS) of the word as generated by the Stanford POS Tagger.
- The LDA topic (among 100 topics) of a small window (-5, +5 words) around the word. The LDA topics are generated from the same corpus that is also used to train the CRF.
Requirements
Libraries/code
- python 2.7 (only tested on that version)
- python-crfsuite
- scikit-learn (used in test to generate classification reports)
- shelve (should be part of python)
- gensim (for the LDA)
- nltk (used for its wrapper of the stanford pos tagger)
- Stanford pos tagger (must be downloaded and extracted somewhere)
Corpus
A large annotated corpus is required that (a) contains one article/document per line, (b) is tokenized (e.g. by the stanford parser) and (c) contains annotated named entities of the form word/LABEL
.
Example (each article shortened, german):
Ang/PER Lee/PER ( $foreign_language ; * 23 . Oktober 1954 in Pingtung/LOC , Taiwan/LOC ) ist ein US-amerikanisch-taiwanischer Filmregisseur , Drehbuchautor und Produzent . Er ist ... Actinium ( latinisiert von griechisch ακτίνα , aktÃna „ Strahl “ ) ist ein radioaktives chemisches Element mit dem Elementsymbol Ac und der Ordnungszahl 89 . Das Element ... Anschluss ist in der Soziologie ein Fachbegriff aus der Systemtheorie von Niklas/PER Luhmann/PER und bezeichnet die in einer sozialen Begegnung auf eine Selektion der ...
(Note: Github markdown eats up the linebreak after every ...
.)
Notice the /PER
and /LOC
labels. BIO codes will automatically be normalized to non-BIO codes (e.g. B-PER
becomes PER
or I-LOC
becomes LOC
).
You will also need word2vec clusters (can come from that corpus or a different one) and brown clusters (same).
Note: You can create a large corpus with annotated names of people from the Wikipedia as names (in Wikipedia articles) are often linked with articles about people, which are identifiable. There are some papers about that.
Usage
- Create a large annotated corpus with the labels
PER
,LOC
,ORG
,MISC
as described above atCorpus
. You can change these labels inconfig.py
, butPER
is required. - Generate word2vec clusters from a large corpus (I used 1000 clusters from 300-component vectors, skipgram, min count 50, window size 10). Use the flag
-classes
for the word2vec tool to generate clusters instead of vectors. This should result in one file. - Generate brown clusters from a large corpus (I used 1000 clusters, min count 12). This should result in several files, including a
paths
file. - Install all requirements including the stanford pos tagger
- Change all constants (specifically the filepaths) in
config.py
to match your settings. You will have to changeARTICLES_FILEPATH
(path to your corpus file),STANFORD_DIR
(root directory of the stanford pos tagger),STANFORD_POS_JAR_FILEPATH
(stanford pos tagger jar filepath, might be different for your version),STANFORD_MODEL_FILEPATH
(pos tagging model to use, default isgerman-fast
),W2V_CLUSTERS_FILEPATH
(filepath to your word2vec clusters),BROWN_CLUSTERS_FILEPATH
(filepath to your brown clusterspaths
file),COUNT_WINDOWS_TRAIN
(number of examples to train on, might be too many for your corpus),COUNT_WINDOWS_TEST
(number of examples to test on, might be too many for your corpus),LABELS
(if you don't use PER, LOC, ORG, MISC as labels, PER though is a requirement). - Run
python -m preprocessing/collect_unigrams
to create lists of unigrams for your corpus. This will take 2 hours or so, especially if your corpus is large. - Run
python -m preprocessing/lda --dict --train
to train the LDA model. This will take 2 hours or so, especially if your corpus is large. - Run
python train.py --identifier="my_experiment"
to train a CRF model with namemy_experiment
. This will likely run for several hours (it did when tested on 20,000 example windows). Notice that the feature generation will be very slow at the first run, as POS tagging and (to a lesser degree) LDA tagging take a lot of time. - Run
python test.py --identifier="my_experiment" --mycorpus
to test your trained CRF model on an excerpt of your corpus (by default on windows 0 to 4,000, while training happens on windows 4,000 to 24,000). This also requires feature generation and will therefore also be slow (at the first run).
Score
Results on the Germeval 2014 NER corpus:
| precision | recall | f1-score | support
----------------|-----------|----------|----------|---------- O | 0.97 | 1.00 | 0.98 | 23487 PER | 0.84 | 0.73 | 0.78 | 525 avg / total | 0.95 | 0.96 | 0.95 | 25002
Note: ~1000 tokens are missing, because they belonged to LOC, ORG or MISC. The CRF model was not really trained on these labels and therefore performed poorly. It was only properly trained on PER.
Results on an automatically annotated Wikipedia corpus (therefore some PER labels might have been wrong/missing):
| precision | recall | f1-score | support
----------------|-----------|----------|----------|---------- O | 0.97 | 0.98 | 0.98 | 182952 PER | 0.88 | 0.85 | 0.87 | 8854 avg / total | 0.95 | 0.95 | 0.95 | 199239
Note: Same as above, LOC, ORG and MISC were removed from the table.
Licence
MIT