• Stars
    star
    370
  • Rank 115,405 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created almost 4 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pretrained Language Models for Korean

Pretrained Language Models For Korean

  • ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ์–ธ์–ด ๋ชจ๋ธ๋“ค์ด ์„ธ๊ณ„ ๊ฐ์ง€์—์„œ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ์ง€๋งŒ ๋Œ€๋ถ€๋ถ„ ์˜์–ด๋งŒ์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์—ฐ๊ตฌ๋ฅผ ์‹œ์ž‘ํ•˜์‹œ๋Š”๋ฐ ๋„์›€์ด ๋˜๊ณ ์ž ํ•œ๊ตญ์–ด๋กœ ํ•™์Šต๋œ ์ตœ์‹  ์–ธ์–ด๋ชจ๋ธ๋“ค์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.
  • Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ๊ฐ€๋Šฅํ•˜๋„๋ก ๋งŒ๋“ค์—ˆ์œผ๋ฉฐ encoder ๊ธฐ๋ฐ˜(BERT ๋“ฑ), decoder ๊ธฐ๋ฐ˜(GPT3), encoder-decoder(T5, BERTSHARED) ๋ชจ๋ธ์„ ๋ชจ๋‘ ์ œ๊ณตํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‰ด์Šค์™€ ๊ฐ™์ด ์ž˜ ์ •์ œ๋œ ์–ธ์–ด ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์‹ค์ œ ์ธํ„ฐ๋„ท ์ƒ์—์„œ ์“ฐ์ด๋Š” ์‹ ์กฐ์–ด, ์ค„์ž„๋ง, ์˜ค์ž, ํƒˆ์ž๋ฅผ ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๊ธฐ ์œ„ํ•ด, ๋Œ€๋ถ„๋ฅ˜ ์ฃผ์ œ๋ณ„ ํ…์ŠคํŠธ๋ฅผ ๋ณ„๋„๋กœ ์ˆ˜์ง‘ํ•˜์˜€์œผ๋ฉฐ ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ๋Š” ๋ธ”๋กœ๊ทธ, ๋Œ“๊ธ€, ๋ฆฌ๋ทฐ์ž…๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ์˜ ์ƒ์—…์  ์‚ฌ์šฉ์˜ ๊ฒฝ์šฐ MOU๋ฅผ ํ†ตํ•ด ๋ฌด๋ฃŒ๋กœ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. [email protected] ๋กœ ๋ฌธ์˜ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.
  • ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๋ฅผ ์ฒ˜์Œ ์ ‘ํ•˜์‹œ๋Š” ๋ถ„๋“ค์„ ์œ„ํ•ด Youtube์— ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๊ธฐ์ดˆ ๊ฐ•์˜๋ฅผ ์˜ฌ๋ ค๋‘์—ˆ์Šต๋‹ˆ๋‹ค(์•ฝ 6์‹œ๊ฐ„ ๋ถ„๋Ÿ‰)

Recent update

  • 2021-01-30: Bertshared (Bert๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ seq2seq๋ชจ๋ธ) ๋ชจ๋ธ ์ถ”๊ฐ€
  • 2021-01-26: GPT3 ๋ชจ๋ธ ์ดˆ๊ธฐ ๋ฒ„์ „ ์ถ”๊ฐ€
  • 2021-01-22: Funnel-transformer ๋ชจ๋ธ ์ถ”๊ฐ€

Pretraining models

Hidden size layers max length batch size learning rate training steps
albert-kor-base 768 12 256 1024 5e-4 0.9M
bert-kor-base 768 12 512 256 1e-4 1.9M
funnel-kor-base 768 6_6_6 512 128 8e-5 0.9M
electra-kor-base 768 12 512 256 2e-4 1.9M
gpt3-kor-small_based_on_gpt2 768 12 2048 4096 1e-2 10K
bertshared-kor-base 768/768 12/12 512/512 16 5e-5 20K
  • ์›๋ณธ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ tokenizer๋Š” ๋ชจ๋“  ๋ชจ๋ธ์— ๋Œ€ํ•ด wordpiece๋กœ ํ†ต์ผํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ usage๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
  • ELECTRA ๋ชจ๋ธ์€ discriminator์ž…๋‹ˆ๋‹ค.
  • BERT ๋ชจ๋ธ์—๋Š” whole-word-masking์ด ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • FUNNEL-TRANSFORMER ๋ชจ๋ธ์€ ELECTRA๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๊ณ  generator์™€ discriminator๊ฐ€ ๋ชจ๋‘ ๋“ค์–ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • GPT3์˜ ๊ฒฝ์šฐ ์ •ํ™•ํ•œ ์•„ํ‚คํ…์ณ๋ฅผ ๊ณต๊ฐœํ•˜์ง„ ์•Š์•˜์ง€๋งŒ GPT2์™€ ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๋ฉฐ few-shot ํ•™์Šต์„ ์œ„ํ•ด input๊ธธ์ด๋ฅผ ๋Š˜๋ฆฌ๊ณ  ๊ณ„์‚ฐ ํšจ์œจํ™”๋ฅผ ์œ„ํ•œ ๋ช‡๊ฐ€์ง€ ์ฒ˜๋ฆฌ๋ฅผ ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ GPT2๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • BERTSHARED๋Š” seq2seq๋ชจ๋ธ๋กœ encoder์™€ decoder๋ฅผ bert-kor-base๋กœ ์ดˆ๊ธฐํ™”ํ•œ ๋‹ค์Œ training์„ ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. Encoder์™€ decoder๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณต์œ ํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ํ•˜๋‚˜์˜ bert ๋ชจ๋ธ ์šฉ๋Ÿ‰์œผ๋กœ seq2seq๋ฅผ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค (reference). ๊ณต๊ฐœํ•œ ๋ชจ๋ธ์€ summarization ํƒœ์Šคํฌ์— ๋Œ€ํ•ด ํ•™์Šตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Notebooks

์„ค๋ช… Colab
GPT3 generation GPT3 ๋ชจ๋ธ์„ ํ†ตํ•ด ํ•œ๊ธ€ ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๋ฌธ์žฅ์˜ ๋’ท๋ถ€๋ถ„์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Open In Colab
Bertshared summarization Bertshared๋ชจ๋ธ์„ ํ†ตํ•ด ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค. Open In Colab
mask prediction Masked language model๋ณ„๋กœ ๋ฌธ์žฅ ์† mask์— ๋“ค์–ด๊ฐˆ ํ™•๋ฅ ์ด ๋†’์€ ๋‹จ์–ด๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Open In Colab
  • ๊ฐ„๋‹จํ•œ ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ์™€ ์‚ฌ์šฉ๋ฒ•์„ ๋ณด์—ฌ๋“œ๋ฆฌ๊ธฐ ์œ„ํ•œ ๊ฒƒ์œผ๋กœ, ์ž์ฒด ๋ฐ์ดํ„ฐ๋กœ ์›ํ•˜์‹œ๋Š” ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” tuning์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Usage

  • Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด pytorch์™€ tensorflow ๋ชจ๋‘์—์„œ ํŽธํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# electra-base-kor
from transformers import ElectraTokenizerFast, ElectraModel, TFElectraModel
tokenizer_electra = ElectraTokenizerFast.from_pretrained("kykim/electra-kor-base")

model_electra_pt = ElectraModel.from_pretrained("kykim/electra-kor-base")    # pytorch
model_electra_tf = TFElectraModel.from_pretrained("kykim/electra-kor-base")  # tensorflow

# bert-base-kor
from transformers import BertTokenizerFast, BertModel
tokenizer_bert = BertTokenizerFast.from_pretrained("kykim/bert-kor-base")
model_bert = BertModel.from_pretrained("kykim/bert-kor-base")

# albert-base-kor
from transformers import BertTokenizerFast, AlbertModel
tokenizer_albert = BertTokenizerFast.from_pretrained("kykim/albert-kor-base")
model_albert = AlbertModel.from_pretrained("kykim/albert-kor-base")

# funnel-base-kor
from transformers import FunnelTokenizerFast, FunnelModel
tokenizer_funnel = FunnelTokenizerFast.from_pretrained("kykim/funnel-kor-base")
model_funnel = FunnelModel.from_pretrained("kykim/funnel-kor-base")

# gpt3-kor-small_based_on_gpt2
from transformers import BertTokenizerFast, GPT2LMHeadModel
tokenizer_gpt3 = BertTokenizerFast.from_pretrained("kykim/gpt3-kor-small_based_on_gpt2")
model_gpt3 = GPT2LMHeadModel.from_pretrained("kykim/gpt3-kor-small_based_on_gpt2")

# bertshared-kor-base
from transformers import BertTokenizerFast, EncoderDecoderModel
tokenizer_bertshared = BertTokenizerFast.from_pretrained("kykim/bertshared-kor-base")
model_bertshared = EncoderDecoderModel.from_pretrained("kykim/bertshared-kor-base")

Dataset

  • ํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  - ๊ตญ๋‚ด ์ฃผ์š” ์ปค๋จธ์Šค ๋ฆฌ๋ทฐ 1์–ต๊ฐœ + ๋ธ”๋กœ๊ทธ ํ˜• ์›น์‚ฌ์ดํŠธ 2000๋งŒ๊ฐœ (75GB)
  - ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ (18GB)
  - ์œ„ํ‚คํ”ผ๋””์•„์™€ ๋‚˜๋ฌด์œ„ํ‚ค (6GB)
  • ๋ถˆํ•„์š”ํ•˜๊ฑฐ๋‚˜ ๋„ˆ๋ฌด ์งค์€ ๋ฌธ์žฅ, ์ค‘๋ณต๋˜๋Š” ๋ฌธ์žฅ๋“ค์„ ์ œ์™ธํ•˜์—ฌ 100GB์˜ ๋ฐ์ดํ„ฐ ์ค‘ ์ตœ์ข…์ ์œผ๋กœ 70GB (์•ฝ 127์–ต๊ฐœ์˜ token)์˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ๋Š” ํ™”์žฅํ’ˆ(8GB), ์‹ํ’ˆ(6GB), ์ „์ž์ œํ’ˆ(13GB), ๋ฐ˜๋ ค๋™๋ฌผ(2GB) ๋“ฑ๋“ฑ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜๋˜์–ด ์žˆ์œผ๋ฉฐ ๋„๋ฉ”์ธ ํŠนํ™” ์–ธ์–ด๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Vocab

Vocab Len lower_case strip_accent
42000 True False
  • ํ•œ๊ธ€, ์˜์–ด, ์ˆซ์ž์™€ ์ผ๋ถ€ ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ์ œ์™ธํ•œ ๋ฌธ์ž๋Š” ํ•™์Šต์— ๋ฐฉํ•ด๊ฐ€๋œ๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ ์‚ญ์ œํ•˜์˜€์Šต๋‹ˆ๋‹ค(์˜ˆ์‹œ: ํ•œ์ž, ์ด๋ชจ์ง€ ๋“ฑ)
  • Huggingface tokenizers ์˜ wordpiece๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด 40000๊ฐœ์˜ subword๋ฅผ ์ƒ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์— 2000๊ฐœ์˜ unused token๊ณผ ๋„ฃ์–ด ํ•™์Šตํ•˜์˜€์œผ๋ฉฐ, unused token๋Š” ๋„๋ฉ”์ธ ๋ณ„ ํŠนํ™” ์šฉ์–ด๋ฅผ ๋‹ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

Fine-tuning

  • Fine-tuning ์ฝ”๋“œ์™€ KoBert, HanBERT, KoELECTRA-Base-v3 ๊ฒฐ๊ณผ๋Š” KoELECTRA ๋ฅผ ์ฐธ๊ณ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ์™ธ์—๋Š” ์ง์ ‘ fine-tuning์„ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋ฉฐ batch size=32, learning rate=3e-5, epoch=5~15๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
Korean-Hate-Speech (Dev)
(F1)
KoBERT 89.59 87.92 81.25 79.62 81.59 94.85 66.21
HanBERT 90.06 87.70 82.95 80.32 82.73 94.72 68.32
kcbert-base 89.87 85.00 67.40 75.57 75.94 93.93 68.78
KoELECTRA-Base-v3 90.63 88.11 84.45 82.24 85.53 95.25 67.61
OURS
albert-kor-base 89.45 82.66 81.20 79.42 81.76 94.59 65.44
bert-kor-base 90.87 87.27 82.80 82.32 84.31 95.25 68.45
electra-kor-base 91.29 87.20 85.50 83.11 85.46 95.78 66.03
funnel-kor-base 91.36 88.02 83.90 84.52 95.51 68.18

Citation

@misc{kim2020lmkor,
  author = {Kiyoung Kim},
  title = {Pretrained Language Models For Korean},
  year = {2020},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/kiyoungkim1/LMkor}}
}

Reference

Acknowledgments

License