rnnmorph
Important: please see https://github.com/natasha/slovnet#morphology-1
Morphological analyzer (POS tagger) for Russian and English languages based on neural networks and dictionary-lookup systems (pymorphy2, nltk).
Contacts
- Telegram: @YallenGusev
Russian language, MorphoRuEval-2017 test dataset, accuracy
Domain | Full tag | PoS tag | F.t. + lemma | Sentence f.t. | Sentence f.t.l. |
---|---|---|---|---|---|
Lenta (news) | 96.31% | 98.01% | 92.96% | 77.93% | 52.79% |
VK (social) | 95.20% | 98.04% | 92.06% | 74.30% | 60.56% |
JZ (lit.) | 95.87% | 98.71% | 90.45% | 73.10% | 43.15% |
All | 95.81% | 98.26% | N/A | 74.92% | N/A |
English language, UD EWT test, accuracy
Dataset | Full tag | PoS tag | F.t. + lemma | Sentence f.t. | Sentence f.t.l. |
---|---|---|---|---|---|
UD EWT test | 91.57% | 94.10% | 87.02% | 63.17% | 50.99% |
Speed and memory consumption
Speed: from 200 to 600 words per second using CPU.
Memory consumption: about 500-600 MB for single-sentence predictions
Install
pip install rnnmorph
Usage
from rnnmorph.predictor import RNNMorphPredictor
predictor = RNNMorphPredictor(language="ru")
forms = predictor.predict(["мама", "мыла", "раму"])
print(forms[0].pos)
>>> NOUN
print(forms[0].tag)
>>> Case=Nom|Gender=Fem|Number=Sing
print(forms[0].normal_form)
>>> мама
print(forms[0].vector)
>>> [0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1]
Training
Acknowledgements
- Anastasyev D. G., Gusev I. O., Indenbom E. M., 2018, Improving Part-of-speech Tagging Via Multi-task Learning and Character-level Word Representations
- Anastasyev D. G., Andrianov A. I., Indenbom E. M., 2017, Part-of-speech Tagging with Rich Language Description, презентация
- Дорожка по морфологическому анализу "Диалога-2017"
- Материалы дорожки
- Morphine by kmike, CRF classifier for MorphoRuEval-2017 by kmike
- Universal Dependencies
- Tobias Horsmann and Torsten Zesch, 2017, Do LSTMs really work so well for PoS tagging? – A replication study
- Barbara Plank, Anders Søgaard, Yoav Goldberg, 2016, Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss