• Stars
    star
    114
  • Rank 308,031 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The code and models for "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks" (AACL-IJCNLP 2020)

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks [pdf]
Kyubyong Park*, Joohong Lee*, Seongbo Jang*, Dawoon Jung*
Accepted to AACL-IJCNLP 2020. (*indicates equal contribution)

Abstract: Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.
Even though Byte Pair Encoding (BPE) has been considered thede facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, "What is the best tokenization strategy for Korean NLP tasks?"
Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective.

Installation

pip install -r requirements.txt

Tokenization Strategies

There are 6 tokenization strategies for Korean. See here to prepare and use each strategy.

  1. Consonant and Vowel
  2. Syllable
  3. Morpheme
  4. Subword
  5. Morpheme-aware Subword
  6. Word

The corpus used for building vocabulary and training BPE models is as follows, which was extracted and refined via attardi/wikiextractor.

Korean from/to English Translation

Tokenization Vocab Size ko-en (Dev) ko-en (Test) en-ko (Dev) en-ko (Test) OOV Rate Avg. Length
CV 166 39.11 38.56 36.52 36.45 0.02 142.75
Syllable 2K 39.3 38.75 38.64 38.45 0.06 69.20
Morpheme 8K 31.59 31.24 32.44 32.19 7.51 49.19
16K 34.38 33.8 35.74 35.52 4.67 49.19
32K 36.19 35.74 36.51 36.12 2.72 49.19
64K 37.88 37.37 37.51 37.03 1.4 49.19
Subword 4K 39.18 38.75 38.31 38.18 0.07 48.02
8K 39.16 38.75 38.09 37.94 0.08 38.44
16K 39.22 38.77 37.64 37.34 0.1 33.69
32K 39.05 38.69 37.11 36.98 0.11 30.21
64K 37.02 36.46 35.77 35.64 0.12 27.50
Morpheme-aware Subword 4K 39.41 38.95 39.29 39.13 0.06 65.17
8K 39.42 39.06 39.78 39.61 0.06 56.79
16K 39.84 39.41 40.23 40.04 0.07 53.30
32K 41.00 40.34 40.43 40.41 0.07 51.38
64K 39.62 39.34 38.63 38.42 0.07 50.27
Word 64K 7.04 7.07 18.68 18.42 26.2 18.96

Dataset

Recently, Korean-English parallel corpus was publicly released by AI Hub, which was gathered from various sources such as news, government web sites, legal documents, etc. We downloaded the news data which amount to 800K sentence pairs, and randomly split them into 784K (train), 8K (dev), and 8K (test).

Training & Evaluation

We ran all the experiments using pytorch/fairseq (Ott et al., 2019), a PyTorch based deep learning library for sequence to sequence models.

1. Preprocess

fairseq-preprocess \
--source-lang ko \
--target-lang en \
--trainpref ./dataset/translation/mecab_sp-8k/train \
--validpref ./dataset/translation/mecab_sp-8k/dev \
--testpref ./dataset/translation/mecab_sp-8k/test \
--destdir ./dataset/translation/mecab_sp-8k/preprocessed/ko-en \
--srcdict ./resources/en_sp-32k/fairseq.vocab \
--tgtdict ./resources/mecab_sp-8k/fairseq.vocab

2. Training

We used Transformer (Vaswani et al., 2017), the state-of-the-art model for neural machine translation. We mostly followed the base model configuration: 6 blocks of 512-2048 units with 8 attention heads.

fairseq-train ./dataset/translation/mecab_sp-8k/preprocessed/ko-en \
--arch transformer \
--share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-epoch 50 \
--batch-size 128 \
--save-dir translation_ckpt/mecab_sp-8k/ko-en \
--disable-validation

3. Evaluation

We report BLEU scores on both the dev and the test sets using Moses multi-bleu.perl script. Following WAT 2019 (Nakazawa et al., 2019), Moses tokenizer and MeCab-ko are used for tokenizing the evaluation data.

fairseq-generate ./dataset/translation/mecab_sp-8k/preprocessed \
--path translation_ckpt/mecab_sp-8k/checkpoint_best.pt \
--batch-size 512 \
--beam 5 \
--remove-bpe sentencepiece

Korean Natural Language Understanding

Tokenization Vocab Size KorQuAD KorNLI KorSTS NSMC PAWS-X
CV 166 59.66 / 73.91 70.6 71.2 77.22 71.47
Syllable 2K 69.10 / 83.29 73.98 73.47 82.7 75.86
Morpheme 32K 68.05 / 83.82 74.86 74.37 82.37 76.83
64K 70.68 / 85.25 75.06 75.69 83.21 77.38
Subword 4K 71.48 / 83.11 74.38 74.03 83.37 76.8
8K 72.91 / 85.11 74.18 74.65 83.23 76.42
16K 73.42 / 85.75 74.46 75.15 83.3 76.41
32K 74.04 / 86.30 74.74 74.29 83.02 77.01
64K 74.04 / 86.66 73.73 74.55 83.52 77.47
Morpheme-aware Subword 4K 67.53 / 81.93 73.53 73.45 83.34 76.03
8K 70.90 / 84.57 74.14 73.95 83.71 76.07
16K 69.47 / 83.36 75.02 74.99 83.22 76.59
32K 72.65 / 86.35 74.1 75.13 83.65 78.11
64K 69.48 / 83.73 76.39 76.61 84.29 76.78
Word 64K 1.54 / 8.86 64.06 65.83 69 60.41

Pre-training

For each tokenization strategy, pre-training of BERT-Base model (Devlin et al., 2019) was performed with a Cloud TPU v3-8 for 1M steps using the official code of google-research/bert.

We set the training hyper-parameters of all models as follows: batch_size=1024, max_sequence_length=128, learning_rate=5e-5, warm_up_steps=10000.

Because the Korean Wiki corpus (640 MB) is not enough in volume for the pre-training purpose, we additionally downloaded the recent dump of Namuwiki (5.5 GB) and extracted plain texts using Namu Wiki Extractor.

Fine-tuning

After converting each pre-trained model in TensorFlow into PyTorch, we fine-tuned them using huggingface/transformers (Wolf et al., 2019).

example:

python tasks/<TASK_NAME>/run_train.py --tokenizer <TOKENIZER_NAME>

Citation

@article{park2020empirical,
  title={An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks},
  author={Park, Kyubyong and Lee, Joohong and Jang, Seongbo and Jung, Dawoon},
  journal={arXiv preprint arXiv:2010.02534},
  year={2020}
}

Acknowledgements

For pre-training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

More Repositories

1

fast-autoaugment

Official Implementation of 'Fast AutoAugment' in PyTorch.
Python
1,587
star
2

nerf-factory

An awesome PyTorch NeRF library
Python
1,265
star
3

pororo

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing
Python
1,252
star
4

coyo-dataset

COYO-700M: Large-scale Image-Text Pair Dataset
Python
1,062
star
5

kogpt

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)
Python
1,000
star
6

torchgpipe

A GPipe implementation in PyTorch
Python
776
star
7

karlo

Python
679
star
8

rq-vae-transformer

The official implementation of Autoregressive Image Generation using Residual Quantization (CVPR '22)
Jupyter Notebook
669
star
9

mindall-e

PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs
Python
630
star
10

word2word

Easy-to-use word-to-word translations for 3,564 language pairs.
Python
350
star
11

torchlars

A LARS implementation in PyTorch
Python
326
star
12

g2pm

A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
Python
326
star
13

kor-nlu-datasets

KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding
283
star
14

trident

A performance library for machine learning applications.
Python
176
star
15

autoclint

A specially designed light version of Fast AutoAugment
Python
170
star
16

sparse-detr

PyTorch Implementation of Sparse DETR
Python
150
star
17

hotr

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)
Python
132
star
18

bassl

Python
113
star
19

scrl

PyTorch Implementation of Spatially Consistent Representation Learning(SCRL)
Python
108
star
20

flame

Official implementation of the paper "FLAME: Free-form Language-based Motion Synthesis & Editing"
Python
103
star
21

brain-agent

Brain Agent for Large-Scale and Multi-Task Agent Learning
Python
92
star
22

helo-word

Team Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task
Python
88
star
23

jejueo

Jejueo Datasets for Machine Translation and Speech Synthesis
Python
74
star
24

solvent

Python
66
star
25

noc

Jupyter Notebook
44
star
26

cxr-clip

Python
43
star
27

expgan

Python
41
star
28

autowu

Official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th ICML Workshop on AutoML)
Python
39
star
29

nvs-adapter

Python
33
star
30

ginr-ipc

The official implementation of Generalizable Implicit Neural Representations with Instance Pattern Composers(CVPR’23 highlight).
Python
30
star
31

coyo-vit

ViT trained on COYO-Labeled-300M dataset
Python
28
star
32

irm-empirical-study

An Empirical Study of Invariant Risk Minimization
Python
28
star
33

coyo-align

ALIGN trained on COYO-dataset
Python
25
star
34

magvlt

The official implementation of MAGVLT: Masked Generative Vision-and-Language Transformer (CVPR'23)
Python
23
star
35

hqtransformer

Locally Hierarchical Auto-Regressive Modeling for Image Generation (HQ-Transformer)
Jupyter Notebook
21
star
36

CheXGPT

Python
18
star
37

learning-loss-for-tta

"Learning Loss for Test-Time Augmentation (NeurIPS 2020)"
Python
9
star
38

stg

Official implementation of Selective Token Generation (COLING'22)
Jupyter Notebook
8
star
39

leco

Official implementation of LECO (NeurIPS'22)
Python
6
star
40

bc-hyperopt-example

brain cloud hyperopt example (mnist)
Python
3
star