This repo will contain a list of useful resources for Mongolian NLP. Feel free to contribute.
Datasets
DATASET
~8 hours Mongolian TTS dataset:MnTTS created from the Inner Mongolia University, ChinaDATASET
LJSpeech like male voice TTS dataset created from the Mongolian Bible- used in tugstugi/pytorch-dc-tts
- use dl_and_preprop_dataset.py to download the audio files
DATASET
LJSpeech like Kalmyk (West Mongolian) female voice TTS dataset created from the Kalmyk Bible (2 hours)DATASET
300 hours Kalmyk synthetic STT dataset created by a voice conversion model- each WAV has a different text created from Kalmyk books
- source voice is the Kalmyk Bible female TTS
- target voices are from the VCTK dataset
- an example WAV: https://twitter.com/tugstugi/status/1409111296897912835
DATASET
Eduge news classification dataset provided by Bolorsoft LLC- used to train the Eduge.mn production news classifier
- 75K news with 9 categories:
урлаг соёл
,эдийн засаг
,эрүүл мэнд
,хууль
,улс төр
,спорт
,технологи
,боловсрол
andбайгал орчин
DATASET
11-11.mn government agency complaint dataset- 80K with 5 categories:
санал хүсэлт
,гомдол
,шүүмжлэл
,талархал
andөргөдөл
- 80K with 5 categories:
DATASET
online news corpus- 700 million words
DATASET
Digital Archive of Mongolian Newspapers 1990-1995 of the British Library- Common Crawl Mongolian dataset
- opendata.burtgel.gov.mn
DATASET
220K Mongolian personal namesDATASET
90K Mongolian clan/family namesDATASET
192K Mongolian company names
DATASET
Mongolian provinces (aimags and sums) namesDATASET
195 country (with capital cities) names in MongolianDATASET
250 Mongolian most frequent words from Mongolian news, books and Wikipedia articles. (total 670M words / 2M unique words).- These words could be used also as the stop words.
DATASET
500 Mongolian abbreviationsDATASET
Mongolian NER dataset created from Mongolian politics and sport newsDATASET
Mongolian POS dataset of the National University of Mongolia- 100k words
- used POS tagsets
DATASET
Traditional Mongolian synthetic OCR dataset created from Mongolian song lyrics and dictionary- 80K images
- without any data augmentation, for augmenting data use external libraries like albumentations.
DATASET
Traditional Mongolian OCR dataset- 164631 sample, 200 people
DATASET
Handwritten Mongolian Cyrillic Characters Database of the Mongolian University of Science and Technology- 28x28 gray scale, 350k images
- dataset description
DATASET
Mongolian Wordnet of the National University of Mongolia- 26875 words, 2979 glosses, 23665 synsets, 213 examples
DATASET
Mongolian Inflectional Morphology from UniMorph 4.0- 2085 lemmas and 14592 inflections (+ morpheme segmentations)
DATASET
Mongolian Derivational Morphology from MorphyNet- 1410 lemmas, 1629 derivations, and 229 derivational suffixes.
DATASET
Multilingual Spoken Words multilingual keyword spotting dataset- 2200 Mongolian keywords, 44000 audio files
- example keywords:
аав
,байна
,бэлдэж
,дүрслэх
,ламын
,олов
,сонирхож
,түүний
,хаанаас
,хуулиар
,чиглэсэн
DATASET
Small Kalmyk text corpus- newspaper, poetry etc.
Mongolian Text-to-Speech
PYTORCH
tugstugi/pytorch-dc-ttsDEMO
Colab online demoDATASET
LJSpeech like male voice dataset created from the Mongolian Bible
TF
tugstugi/Tacotron-2 fork of Rayhane-mamah/Tacotron-2 adapted for the Mongolian Bible datasetDEMO
Colab online demoDEMO
speaker adaptation Colab online demo for the former Mongolian president Elbegdorj. The Tacotron model trained with the 5 hours Mongolian Bible dataset was fine tuned with a 10 minutes dataset created from a Elbegdorj's speech.
PYTORCH
Chimege TTS demo- 1x female
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO
HMM TTS online demo of the National University of Mongolia- 1x male and 2x female voices
DEMO
Yet another HMM? TTS online demo from “Мон Спийч Ай Ти” ХХК- demo server is currently down
- 1x male and 1x female
- female voice samples
SAMPLES
Tacotron2 TTS demo samples of Ikon.MN- 1x female (35h)
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO
HMM based TTS online demo of the Inner Mongolian university- 1x female
DEMO
MTL-Tacotron TTS demo samples of the Inner Mongolian university & National University of Singapore- 1x female
TF
ttslr/MonTTS Inner Mongolian TTS training codeSAMPLES
Speech samplesDATASET SAMPLES
MonSpeech of the Inner Mongolia University- dataset and pretrained models are not available
TF
walker-hyf/MnTTS Inner Mongolian TTS dataset and training codeSAMPLES
Speech samplesDATASET
MnTTS of the Inner Mongolia UniversityPretrained Model
download link- dataset and pretrained models are available :)
PRODUCT
NVDA/HTS screen reader developed by Innovation Development Center for the blind- 1x female (National University of Mongolia voice)
PYTORCH/DEMO
Kalmyk TTS demo Kalmyk is a Mongolic language spoken in Russia- dataset created from the Kalmyk Bible (2 hours)
- NVIDIA/tacotron2 + NVIDIA/waveglow
PYTORCH/DEMO
Kalmyk TTS demo from Silero Kalmyk is a Mongolic language spoken in Russia
Mongolian Language Model
MODEL
5-gram binary LM generated by KenLM on a 670M word dirty corpus.- it can be used either with mozilla/DeepSpeech:
./generate_trie alphabet.txt mn_5gram.binary trie
- or in tugstugi/mongolian-speech-recognition
- it can be used either with mozilla/DeepSpeech:
TF
/PYTORCH
tugstugi/mongolian-bert pretrained Mongolian BERT models- trained by tugstugi, enod and sharavsambuu
- nabar sponsored 5x TPUs.
PYTORCH
bayartsogt-ya/albert-mongolian pretrained Mongolian ALBERTPYTORCH
robertritz/NLP ULMFiT experimentsPYTORCH
huggingface.co/bayartsogt/mongolian-gpt2 Mongolian GPT-2 modelPYTORCH
huggingface.co/bayartsogt/mongolian-roberta-base Mongolian Roberta base model
Mongolian Speech Recognition
PYTORCH
tugstugi/mongolian-speech-recognitionDEMO
Chimege Speech Recognition- a proprietary dataset is used
PRODUCT
Chinese and traditional Mongolian voice input from aicloud.comDEMO
Speech recognition of the Inner Mongolian university- seems to be non functional
PRODUCT
Huawei cloud ASR supports minority languages such as Mongolian, Tibetan, and Uyghur.PRODUCT
Google Cloud Speech-to-text- 20% WER on a 3000 audio private test dataset
PYTORCH
Wav2Vec2 XLSR finetuned on Mongolian Common VoiceDEMO
Colab online demo- 50% WER
PYTORCH
Wav2Vec2 XLSR trained on Kalmyk dataset- pretrained on 500 hours Kalmyk TV recordings and 1000 hours Mongolian speech recognition dataset
- finetuned on 300 hours synthetic Kalmyk STT dataset created by voice conversion
- 50% WER on a private test set created from Kalmyk TV recordnings, on clean voice recordings, it should have much lower WER
DEMO
https://huggingface.co/tugstugi/wav2vec2-large-xlsr-53-kalmyk
TF
coqui.ai mongolian speech recognition trained on Mongolian CommonVoice- 90.08% WER
Mongolian Script
DEMO
Cyrillic to Mongolian script converter demo of the Inner Mongolian universityDEMO
Mongolian script OCR demo of the Inner Mongolian universityPYTORCH
tugstugi/bichig2cyrillic Mongolian script to (and back) cyrillic converterPYTORCH
tugstugi/image2bichig Traditional Mongolian OCR using CRNN
Mongolian Text Classification
TF2
sharavsambuu/mongolian-text-classificationSKLEARN
/DEMO
simple SVM Colab notebook classifying the Eduge dataset with around 91% accuracy.- SentencePiece model from tugstugi/mongolian-bert is used as the text tokenizer.
Mongolian Named Entity Recognition
DATASET
Mongolian NER dataset created from Mongolian politics and sport news- for more info see datasets
PYTORCH
enod/mongolian-bert-ner BERT based Mongolian NER- uses tugstugi/mongolian-bert Mongolian pre-trained BERT models
DEMO
NER demo of the National University of Mongolia
Misc
PYTORCH
tugstugi/forced_aligner Mongolian forced alignment tool using Rayhane-mamah/Tacotron-2 and readbeyond/aeneasDEMO
Colab online demo
TF2
cyrillic transliteration Colab notebook sharavsambuu/cyrillic-mongolian-transliterationDATASET
1M back-translated MN->EN sentence dataset download linkDICTIONARY
Mongolian digitalized dictionaries from Center for Northeast Asian of the Tohoku University in Japan- for usage see Digitizing the Mongolian Language: An Introduction to the Polyglot “Online Dictionaries and Full-text Search of Mongolian Languages and Written Manchu”
- it includes also IPA pronuncations for Mongolian words