• Stars
    star
    164
  • Rank 230,032 (Top 5 %)
  • Language
  • Created over 4 years ago
  • Updated 28 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

1. Datasets / Corpora

Monolingual

  • Brown-UK โ€” carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
  • UberText 2.0 โ€” over 5 GB of news, Wikipedia, social, fiction, and legal texts
  • Wikipedia
  • OSCAR โ€” shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
  • CC-100 โ€” documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
  • mC4 โ€” filtered CommonCrawl again, 196GB of Ukrainian text.
  • Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
  • Ukrainian forums โ€” 250k sentences scraped from forums.
  • Ukrainain news headlines โ€” 5.2M news headlines.

Parallel

See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.

Labeled

Dictionaries

2. Tools

  • tree_stem โ€” stemmer
  • pymorphy2 + pymorphy2-dicts-uk โ€” POS tagger and lemmatizer
  • LanguageTool โ€” grammar, style and spell checker
  • Stanza โ€” Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
  • nlp-uk โ€” Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation

3. Pretrained models

Language models

Masked:

Autoregressive:

  • pythia-uk โ€” mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
  • UAlpaca โ€” Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
  • XGLM โ€” multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
  • Tereveni-AI/GPT-2

Mixed:

Machine translation

See Helsinki-NLP/ UkrainianLT for more.

Sequence-to-sequence models

Named-entity recognition (NER)

Part-of-speech tagging (POS)

Word embeddings

Other

4. Paid

5. Other resources and links

6. Workshops and conferences