awesome-ukrainian-nlp
Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
1. Datasets / Corpora
Monolingual
- Brown-UK โ carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
- UberText 2.0 โ over 5 GB of news, Wikipedia, social, fiction, and legal texts
- Wikipedia
- OSCAR โ shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
- CC-100 โ documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
- mC4 โ filtered CommonCrawl again, 196GB of Ukrainian text.
- Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
- Ukrainian forums โ 250k sentences scraped from forums.
- Ukrainain news headlines โ 5.2M news headlines.
Parallel
- OPUS
- Tatoeba MT Challenge data sets
- Polish-Ukrainian Parallel Corpus
- Back-translated monolingual Wiki data
- Wiki Edits โ 5M sentence edits extracted from the Ukrainian Wikipedia revision history.
See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.
Labeled
- UA-GEC โ grammatical error correction (GEC) and fluency corpus.
- NER-uk โ Brown-UK labeled for named entities.
- Yakaboo Book Reviews โ book reviews, ratings and descriptions.
- Universal Dependencies โ dependency trees corpus.
- ua-news โ 150k news article in 5 categories.
- UA-SQuAD โ Ukrainian version of Stanford Question Answering Dataset.
- Ukrainian Winograd schema challenge (WSC) Dataset โ manually translated.
- Ukrainian OntoNotes Dataset โ scripts to build large silver dataset for coreference resolution.
Dictionaries
- ะะะกะฃะ โ POS tag dictionary. Can generate a list of all word forms valid for spelling.
- Tonal dictionary
- Multilingualsentiment, includes Ukrainian - a list of positive/negative words
- obscene-ukr โ profanity dictionary
- Word stress dictionary โ word stress for 2.7M word forms. See ukrainian-word-stress
- Heteronyms โ words that share the same spelling but have different meaning/pronunciation.
- Abbreviations โ map abbreviation to expansion
2. Tools
- tree_stem โ stemmer
- pymorphy2 + pymorphy2-dicts-uk โ POS tagger and lemmatizer
- LanguageTool โ grammar, style and spell checker
- Stanza โ Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
- nlp-uk โ Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
3. Pretrained models
Language models
Masked:
- xlm-roberta-base-uk โ truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left.
- youscan/ukr-roberta-base
Autoregressive:
- pythia-uk โ mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
- UAlpaca โ Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
- XGLM โ multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
- Tereveni-AI/GPT-2
Mixed:
Machine translation
- Helsinki-NLP / OPUS-MT models โ Ukrainian to/from 25 langaguages.
- M2M-100 โ Ukrainian to/from 100 languages.
See Helsinki-NLP/ UkrainianLT for more.
Sequence-to-sequence models
Named-entity recognition (NER)
Part-of-speech tagging (POS)
Word embeddings
- fastText
- Official fastText trained on CommonCrawl and Wiki โ 157 languages, including Ukrainian.
- Older official fastText trained on Wiki โ 294 languages, including Ukrainian.
- fastText_multilingual โ 78 languages, aligned to the same vector space.
- fasttext_uk (2023) and cbow โ trained on UberText 2.0
- Word2Vec
- GloVe
- LexVec
- BPEmb: Subword Embeddings, includes Ukrainian - easy to use with Flair
- Flair โ Ukrainian added in 2022.
Other
- uk-punctcase โ punctuation and case restoration model based on XLM-RoBERTa-Uk.
- punctuation_uk_bert โ another punctation and case restoration model based on bert-base-multilingual-cased.
- ukrainian-word-stress โ adds word stress.
4. Paid
- LORELEI Ukrainian Representative Language Pack - Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities
5. Other resources and links
- Helsinki-NLP/ UkrainianLT โ another collection of links to Ukrainian language tools.
- egorsmkv / speech-recognition-uk โ speech recognition and text-to-speech models and datasets
- UNLP 2023 shared task โ shared task (competition) in grammatical error correction for Ukrainian