Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.
Usage
For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Use corus
to load the data:
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)
LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)
Iterate over texts:
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...
For links to other datasets and their loaders see the Reference section.
Documentation
Materials are in Russian:
Install
corus
supports Python 3.5+, PyPy 3.
$ pip install corus
Reference
Dataset | API from corus import |
Tags | Texts | Uncompressed | Description |
---|---|---|---|---|---|
Lenta.ru | |||||
Lenta.ru v1.0 |
load_lenta
#
|
news
|
739 351 | 1.66 Gb |
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
|
Lenta.ru v1.1+ |
load_lenta2
#
|
news
|
800 975 | 1.94 Gb |
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2
|
Lib.rus.ec |
load_librusec
#
|
fiction
|
301 871 | 144.92 Gb |
Dump of lib.rus.ec prepared for RUSSE workshop
wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
|
Rossiya Segodnya |
load_ria_raw
#
load_ria
#
|
news
|
1 003 869 | 3.70 Gb |
wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
|
Mokoron Russian Twitter Corpus |
load_mokoron
#
|
social
sentiment
|
17 633 417 | 1.86 Gb |
Russian Twitter sentiment markup
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql |
Wikipedia |
load_wiki
#
|
1 541 401 | 12.94 Gb |
Russian Wiki dump
wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
|
|
GramEval2020 |
load_gramru
#
|
162 372 | 30.04 Mb |
wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip
unzip master.zip
mv GramEval2020-master/dataTrain train
mv GramEval2020-master/dataOpenTest dev
rm -r master.zip GramEval2020-master
wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu
|
|
OpenCorpora |
load_corpora
#
|
morph
|
4 030 | 20.21 Mb |
wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
|
RusVectores SimLex-965 |
load_simlex
#
|
emb
sim
|
wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
wget https://rusvectores.org/static/testsets/ru_simlex965.tsv
|
||
Omnia Russica |
load_omnia
#
|
morph
web
fiction
|
489.62 Gb |
Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf
Manually download http://bit.ly/2ZT4BY9 |
|
factRuEval-2016 |
load_factru
#
|
ner
news
|
254 | 969.27 Kb |
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition
wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
|
Gareev |
load_gareev
#
|
ner
news
|
97 | 455.02 Kb |
Manual PER, ORG markup (no LOC)
Email Rinat Gareev ([email protected]) ask for dataset tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
|
Collection5 |
load_ne5
#
|
ner
news
|
1 000 | 2.96 Mb |
News articles with manual PER, LOC, ORG markup
wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
|
WiNER |
load_wikiner
#
|
ner
|
203 287 | 36.15 Mb |
Sentences from Wiki auto annotated with PER, LOC, ORG tags
wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
|
BSNLP-2019 |
load_bsnlp
#
|
ner
|
464 | 1.16 Mb |
Markup prepared for 2019 BSNLP Shared Task
wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip
wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip
unzip TRAININGDATA_BSNLP_2019_shared_task.zip
unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg
rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip
|
Persons-1000 |
load_persons
#
|
ner
news
|
1 000 | 2.96 Mb |
Same as Collection5, only PER markup + normalized names
wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip
|
The Russian Drug Reaction Corpus (RuDReC) |
load_rudrec
#
|
ner
|
4 809 | 1.73 Kb |
RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json
|
Taiga |
Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks
wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
|
||||
Arzamas |
load_taiga_arzamas
#
|
news
|
311 | 4.50 Mb | |
Fontanka |
load_taiga_fontanka
#
|
news
|
342 683 | 786.23 Mb | |
Interfax |
load_taiga_interfax
#
|
news
|
46 429 | 77.55 Mb | |
KP |
load_taiga_kp
#
|
news
|
45 503 | 61.79 Mb | |
Lenta |
load_taiga_lenta
#
|
news
|
36 446 | 95.15 Mb | |
Taiga/N+1 |
load_taiga_nplus1
#
|
news
|
7 696 | 24.96 Mb | |
Magazines |
load_taiga_magazines
#
|
39 890 | 2.19 Gb | ||
Subtitles |
load_taiga_subtitles
#
|
19 011 | 909.08 Mb | ||
Social |
load_taiga_social
#
|
social
|
1 876 442 | 648.18 Mb | |
Proza |
load_taiga_proza
#
|
fiction
|
1 732 434 | 38.25 Gb | |
Stihi |
load_taiga_stihi
#
|
9 157 686 | 12.80 Gb | ||
Russian NLP Datasets | Several Russian news datasets from webhose.io, lenta.ru and other news sites. | ||||
News |
load_buriy_news
#
|
news
|
2 154 801 | 6.84 Gb |
Dump of top 40 news + 20 fashion news sites.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
|
Webhose |
load_buriy_webhose
#
|
news
|
285 965 | 859.32 Mb |
Dump from webhose.io, 300 sources for one month.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2
|
ODS #proj_news_viz | Several news sites scraped by members of #proj_news_viz ODS project. | ||||
Interfax |
load_ods_interfax
#
|
news
|
543 961 | 1.22 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz
|
Gazeta |
load_ods_gazeta
#
|
news
|
865 847 | 1.63 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz
|
Izvestia |
load_ods_izvestia
#
|
news
|
86 601 | 307.19 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz
|
Meduza |
load_ods_meduza
#
|
news
|
71 806 | 270.11 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz
|
RIA |
load_ods_ria
#
|
news
|
101 543 | 233.88 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz
|
Russia Today |
load_ods_rt
#
|
news
|
106 644 | 187.12 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz
|
TASS |
load_ods_tass
#
|
news
|
1 135 635 | 3.27 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz
|
Universal Dependencies | |||||
GSD |
load_ud_gsd
#
|
morph
syntax
|
5 030 | 1.01 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu
|
Taiga |
load_ud_taiga
#
|
morph
syntax
|
3 264 | 353.80 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu
|
PUD |
load_ud_pud
#
|
morph
syntax
|
1 000 | 207.78 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
|
SynTagRus |
load_ud_syntag
#
|
morph
syntax
|
61 889 | 11.33 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu
|
morphoRuEval-2017 | |||||
General Internet-Corpus |
load_morphoru_gicrya
#
|
morph
|
83 148 | 10.58 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip
unzip GIKRYA_texts_new.zip
rm GIKRYA_texts_new.zip
|
Russian National Corpus |
load_morphoru_rnc
#
|
morph
|
98 892 | 12.71 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar
unrar x RNC_texts.rar
rm RNC_texts.rar
|
OpenCorpora |
load_morphoru_corpora
#
|
morph
|
38 510 | 4.80 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar
unrar x OpenCorpora_Texts.rar
rm OpenCorpora_Texts.rar
|
RUSSE Russian Semantic Relatedness | |||||
HJ: Human Judgements of Word Pairs |
load_russe_hj
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
|
||
RT: Synonyms and Hypernyms from the Thesaurus RuThes |
load_russe_rt
#
|
emb
sim
|
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
|
||
AE: Cognitive Associations from the Sociation.org Experiment |
load_russe_ae
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv
|
||
Toloka Datasets | |||||
Lexical Relations from the Wisdom of the Crowd (LRWC) |
load_toloka_lrwc
#
|
emb
sim
|
wget https://tlk.s3.yandex.net/dataset/LRWC.zip
unzip LRWC.zip
rm LRWC.zip
|
||
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT) |
load_ruadrect
#
|
social
|
9 515 | 2.09 Mb |
This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip
unzip RuADReCT.zip
rm RuADReCT.zip
|
Support
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru
Add new source
- Implement
corus/sources/<source>.py
- Add import into
corus/sources/__init__.py
- Add meta into
corus/source/meta.py
- Add example into
docs.ipynb
(check meta table is correct) - Run tests (readme is updated)
Development
Dev env
python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate
pip install -r requirements/dev.txt
pip install -e .
python -m ipykernel install --user --name natasha-corus
Lint + update docs
make lint
make exec-docs
Release
# Update setup.py version
git commit -am 'Up version'
git tag v0.9.0
git push
git push --tags