• Stars
    star
    116
  • Rank 303,894 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created almost 5 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pretraining transformer based Thai language models

thai2transformers

Pretraining transformer-based Thai language models


thai2transformers provides customized scripts to pretrain transformer-based masked language model on Thai texts with various types of tokens as follows:

  • spm: a subword-level token from SentencePiece library.
  • newmm : a dictionary-based Thai word tokenizer based on maximal matching from PyThaiNLP.
  • syllable: a dictionary-based Thai syllable tokenizer based on maximal matching from PyThaiNLP. The list of syllables used is from pythainlp/corpus/syllables_th.txt.
  • sefr: a ML-based Thai word tokenizer based on Stacked Ensemble Filter and Refine (SEFR) [Limkonchotiwat et al., 2020] based on probabilities from CNN-based deepcut and SEFR tokenizer is loaded with engine="best".


Thai texts for language model pretraining


We curate a list of sources that can be used to pretrain language model. The statistics for each data source are listed in this page.

Also, you can download current version of cleaned datasets from here.



Model pretraining and finetuning instructions:


a) Instruction for RoBERTa BASE model pretraining on Thai Wikipedia dump:

In this example, we demonstrate how pretrain RoBERTa base model on Thai Wikipedia dump from scratch

  1. Install required libraries: 1_installation.md

  2. Prepare thwiki dataset from Thai Wikipedia dump: 2_thwiki_data-preparation.md

  3. Tokenizer training and vocabulary building :

    a) For SentencePiece BPE (spm), word-level token (newmm), syllable-level token (syllable): 3_train_tokenizer.md

    b) For word-level token from Limkonchotiwat et al., 2020 (sefr-cut) : 3b_sefr-cut_pretokenize.md

  4. Pretrain a masked langauge model: 4_run_mlm.md


b) Instruction for RoBERTa model finetuning on existing Thai text classification, and NER/POS tagging datasets.

In this example, we demonstrate how to finetune WanchanBERTa, a RoBERTa base model pretrained on Thai Wikipedia dump and Thai assorted texts.

  • Finetune model for sequence classification task from exisitng datasets including wisesight_sentiment, wongnai_reviews, generated_reviews_enth (review star prediction), and prachathai67k: 5a_finetune_sequence_classificaition.md

  • Finetune model for token classification task (NER and POS tagging) from exisitng datasets including thainer and lst20: 5b_finetune_token_classificaition.md



BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

More Repositories

1

wav2vec2-large-xlsr-53-th

Finetune wav2vec2-large-xlsr-53 with Thai Common Voice Corpus 7.0
Jupyter Notebook
45
star
2

WangchanX

WangchanX Fine-tuning Pipeline
Jupyter Notebook
42
star
3

Thai-NNER

Pytorch implementation of paper: Thai Nested Named Entity Recognition
Python
39
star
4

dataset-releases

36
star
5

commonvoice-th

Kaldi recipe to train commonvoice corpus in Thai language
Shell
33
star
6

thai2nmt

English-Thai Machine Translation Models
Python
27
star
7

mt-opus

English-Thai Machine Translation with OPUS data
Jupyter Notebook
19
star
8

ai-builders

A program for kids who want to build good AI
Jupyter Notebook
18
star
9

vistec-ser

Speech Emotion Recognition using PyTorch sponsored by AIS and VISTEC-DEPA AIResearch Institute Thailand.
Python
17
star
10

crfcut

Thai sentence segmentation with conditional random fields
Jupyter Notebook
15
star
11

model-releases

14
star
12

WangchanX-Eval

WangchanX Eval
Python
9
star
13

wangchan-analytica

Business Analytics class at VISTEC
Jupyter Notebook
8
star
14

colab

Collections of Google Colab notebooks and some data.
Jupyter Notebook
7
star
15

sme-depa

Help small businesses make money from their transaction data; workshop at depa
Jupyter Notebook
7
star
16

WSSET

TF2 implementation of paper: Self-supervised Deep Metric Learning for Pointsets, ICDE 2021
Python
7
star
17

WangchanLion

Python
5
star
18

thai_websites_crawler

Scripts for crawling the 500 most visited websites in Thailand according to Alexa for `th` and `en` parallel texts.
Python
5
star
19

mt-datasets

Collecting bi-/tri-lingual sources for MT workstream
Jupyter Notebook
4
star
20

thwiki-text

Python
4
star
21

fake_reviews

Generate fake Amazon review datasets for VISTEC-depa machine translation project using CTRL
Jupyter Notebook
3
star
22

thai2nmt_preprocess

Python
3
star
23

Bilingual-Financial-NER-Model

Python
3
star
24

ai-builders-orientation

Lesson 0 - Orientation
Jupyter Notebook
2
star
25

pdf2parallel

Extract en-th parallel sentences from PDFs
Python
2
star
26

paracrawl-en-th

Replicate paracrawl for en-th parallel texts
Jupyter Notebook
2
star
27

ai2api

Productionize NLP models trained on Pytorch by AIResearch.in.th
Jupyter Notebook
1
star
28

ner_workshop

1
star
29

capital_market_text_data

1
star
30

scb_workshop

1
star