KoRean based Bert pre-trained (KR-BERT)
This is a release of Korean-specific, small-scale BERT models with comparable or better performances developed by Computational Linguistics Lab at Seoul National University, referenced in KR-BERT: A Small-Scale Korean-Specific Language Model.
Vocab, Parameters and Data
Mulitlingual BERT (Google) |
KorBERT (ETRI) |
KoBERT (SKT) |
KR-BERT character | KR-BERT sub-character | |
---|---|---|---|---|---|
vocab size | 119,547 | 30,797 | 8,002 | 16,424 | 12,367 |
parameter size | 167,356,416 | 109,973,391 | 92,186,880 | 99,265,066 | 96,145,233 |
data size | - (The Wikipedia data for 104 languages) |
23GB 4.7B morphemes |
- (25M sentences, 233M words) |
2.47GB 20M sentences, 233M words |
2.47GB 20M sentences, 233M words |
Model | Masked LM Accuracy |
---|---|
KoBERT | 0.750 |
KR-BERT character BidirectionalWordPiece | 0.779 |
KR-BERT sub-character BidirectionalWordPiece | 0.769 |
Sub-character
Korean text is basically represented with Hangul syllable characters, which can be decomposed into sub-characters, or graphemes. To accommodate such characteristics, we trained a new vocabulary and BERT model on two different representations of a corpus: syllable characters and sub-characters.
In case of using our sub-character model, you should preprocess your data with the code below.
import torch
from transformers import BertConfig, BertModel, BertForPreTraining, BertTokenizer
from unicodedata import normalize
tokenizer_krbert = BertTokenizer.from_pretrained('/path/to/vocab_file.txt', do_lower_case=False)
# convert a string into sub-char
def to_subchar(string):
return normalize('NFKD', string)
sentence = 'ν ν¬λμ΄μ μμμ
λλ€.'
print(tokenizer_krbert.tokenize(to_subchar(sentence)))
Tokenization
BidirectionalWordPiece Tokenizer
We use the BidirectionalWordPiece model to reduce search costs while maintaining the possibility of choice. This model applies BPE in both forward and backward directions to obtain two candidates and chooses the one that has a higher frequency.
Mulitlingual BERT | KorBERT character |
KoBERT | KR-BERT character WordPiece |
KR-BERT character BidirectionalWordPiece |
KR-BERT sub-character WordPiece |
KR-BERT sub-character BidirectionalWordPiece |
|
---|---|---|---|---|---|---|---|
λμ₯κ³ nayngcangko "refrigerator" |
λ#μ₯#κ³ nayng#cang#ko |
λ#μ₯#κ³ nayng#cang#ko |
λ#μ₯#κ³ nayng#cang#ko |
λμ₯κ³ nayngcangko |
λμ₯κ³ nayngcangko |
λμ₯κ³ nayngcangko |
λμ₯κ³ nayngcangko |
μΆ₯λ€ chwupta "cold" |
[UNK] | μΆ₯#λ€ chwup#ta |
μΆ₯#λ€ chwup#ta |
μΆ₯#λ€ chwup#ta |
μΆ₯#λ€ chwup#ta |
μΆ#γ
λ€ chwu#pta |
μΆ#γ
λ€ chwu#pta |
λ±μ¬λ paytsalam "seaman" |
[UNK] | λ±#μ¬λ payt#salam |
λ±#μ¬λ payt#salam |
λ±#μ¬λ payt#salam |
λ±#μ¬λ payt#salam |
λ°°#γ
#μ¬λ pay#t#salam |
λ°°#γ
#μ¬λ pay#t#salam |
λ§μ΄ν¬ maikhu "microphone" |
λ§#μ΄#ν¬ ma#i#khu |
λ§μ΄#ν¬ mai#khu |
λ§#μ΄#ν¬ ma#i#khu |
λ§μ΄ν¬ maikhu |
λ§μ΄ν¬ maikhu |
λ§μ΄ν¬ maikhu |
λ§μ΄ν¬ maikhu |
Models
TensorFlow | PyTorch | |||
---|---|---|---|---|
character | sub-character | character | sub-character | |
WordPiece tokenizer |
WP char | WP subchar | WP char | WP subchar |
Bidirectional WordPiece tokenizer |
BiWP char | BiWP subchar | BiWP char | BiWP subchar |
Requirements
- transformers == 2.1.1
- tensorflow < 2.0
Downstream tasks
Naver Sentiment Movie Corpus (NSMC)
-
If you want to use the sub-character version of our models, let the
subchar
argument beTrue
. -
And you can use the original BERT WordPiece tokenizer by entering
bert
for thetokenizer
argument, and if you useranked
you can use our BidirectionalWordPiece tokenizer. -
tensorflow: After downloading our pretrained models, put them in a
models
directory in thekrbert_tensorflow
directory. -
pytorch: After downloading our pretrained models, put them in a
pretrained
directory in thekrbert_pytorch
directory.
# pytorch
python3 train.py --subchar {True, False} --tokenizer {bert, ranked}
# tensorflow
python3 run_classifier.py \
--task_name=NSMC \
--subchar={True, False} \
--tokenizer={bert, ranked} \
--do_train=true \
--do_eval=true \
--do_predict=true \
--do_lower_case=False\
--max_seq_length=128 \
--train_batch_size=128 \
--learning_rate=5e-05 \
--num_train_epochs=5.0 \
--output_dir={output_dir}
The pytorch code structure refers to that of https://github.com/aisolab/nlp_implementation .
NSMC Acc.
multilingual BERT | KorBERT | KoBERT | KR-BERT character WordPiece | KR-BERT character Bidirectional WordPiece |
KR-BERT sub-character WordPiece | KR-BERT sub-character Bidirectional WordPiece |
|
---|---|---|---|---|---|---|---|
pytorch | - | 89.84 | 89.01 | 89.34 | 89.38 | 89.20 | 89.34 |
tensorflow | 87.08 | 85.94 | n/a | 89.86 | 90.10 | 89.76 | 89.86 |
Citation
If you use these models, please cite the following paper:
@article{lee2020krbert,
title={KR-BERT: A Small-Scale Korean-Specific Language Model},
author={Sangah Lee and Hansol Jang and Yunmee Baik and Suzi Park and Hyopil Shin},
year={2020},
journal={ArXiv},
volume={abs/2008.03979}
}