• Stars
    star
    180
  • Rank 205,512 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Distillation of KoBERT from SKTBrain (Lightweight KoBERT)

DistilKoBERT

Distillation of KoBERT (SKTBrain KoBERT ๊ฒฝ๋Ÿ‰ํ™”)

January 27th, 2020 - Update: 10GB์˜ Corpus๋ฅผ ๊ฐ€์ง€๊ณ  ์ƒˆ๋กœ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค. Subtask์—์„œ ์„ฑ๋Šฅ์ด ์†Œํญ ์ƒ์Šนํ–ˆ์Šต๋‹ˆ๋‹ค.
May 14th, 2020 - Update: ๊ธฐ์กด Transformers์˜ padding_idx ์ด์Šˆ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ์‚ฌํ•ญ์€ KoBERT-Transformers๋ฅผ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Pretraining DistilKoBERT

  • ๊ธฐ์กด์˜ 12 layer๋ฅผ 3 layer๋กœ ์ค„์˜€์œผ๋ฉฐ, ๊ธฐํƒ€ configuration์€ kobert๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ž์Šต๋‹ˆ๋‹ค.
  • Layer ์ดˆ๊ธฐํ™”์˜ ๊ฒฝ์šฐ ๊ธฐ์กด KoBERT์˜ 1, 5, 9๋ฒˆ์งธ layer ๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • Pretraining Corpus๋Š” ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‚˜๋ฌด์œ„ํ‚ค, ๋‰ด์Šค ๋“ฑ ์•ฝ 10GB์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, 3 epoch ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.

KoBERT / DistilKoBERT for transformers library

  • ๊ธฐ์กด์˜ KoBERT๋ฅผ transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งž์ท„์Šต๋‹ˆ๋‹ค.
    • transformers v2.2.2๋ถ€ํ„ฐ ๊ฐœ์ธ์ด ๋งŒ๋“  ๋ชจ๋ธ์„ transformers๋ฅผ ํ†ตํ•ด ์ง์ ‘ ์—…๋กœ๋“œ/๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
    • DistilKoBERT ์—ญ์‹œ transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ๊ณง๋ฐ”๋กœ ๋‹ค์šด ๋ฐ›์•„์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Dependencies

  • torch==1.1.0
  • transformers==2.9.1

How to Use

>>> from transformers import BertModel, DistilBertModel
>>> bert_model = BertModel.from_pretrained('monologg/kobert')
>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert')
  • Tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ๋ฃจํŠธ ๋””๋ ‰ํ† ๋ฆฌ์˜ tokenization_kobert.py ํŒŒ์ผ์„ ๋ณต์‚ฌํ•œ ํ›„, KoBertTokenizer๋ฅผ ์ž„ํฌํŠธํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.
    • KoBERT์™€ DistilKoBERT ๋ชจ๋‘ ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๊ธฐ์กด KoBERT์˜ ๊ฒฝ์šฐ Special Token์ด ์ œ๋Œ€๋กœ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š๋Š” ์ด์Šˆ๊ฐ€ ์žˆ์–ด์„œ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ˆ˜์ •ํ•˜์—ฌ ๋ฐ˜์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (Issue link)
>>> from tokenization_kobert import KoBertTokenizer
>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋„ ๋™์ผ
>>> tokenizer.tokenize("[CLS] ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. [SEP]")
['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]'])
[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

What is different between BERT and DistilBERT

  • DistilBert๋Š” ๊ธฐ์กด์˜ Bert์™€ ๋‹ฌ๋ฆฌ token-type embedding์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    • Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ DistilBertModel์„ ์‚ฌ์šฉํ•  ๋•Œ ๊ธฐ์กด BertModel ๊ณผ ๋‹ฌ๋ฆฌ token_type_ids๋ฅผ ๋„ฃ์„ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ DistilBert๋Š” pooler๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    • ๊ณ ๋กœ ๊ธฐ์กด BertModel์˜ ๊ฒฝ์šฐ forward์˜ return ๊ฐ’์œผ๋กœ sequence_output, pooled_output, (hidden_states), (attentions)์„ ๋ฝ‘์•„๋‚ด์ง€๋งŒ, DistilBertModel์˜ ๊ฒฝ์šฐ sequence_output, (hidden_states), (attentions)๋ฅผ ๋ฝ‘์•„๋ƒ…๋‹ˆ๋‹ค.
    • DistilBert์—์„œ [CLS] ํ† ํฐ์„ ๋ฝ‘์•„๋‚ด๋ ค๋ฉด sequence_output[0][:, 0]๋ฅผ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Result on Sub-task

KoBERT DistilKoBERT Bert-multilingual
Model Size (MB) 351 108 681
NSMC (acc) 89.63 88.41 87.07
Naver NER (F1) 86.11 84.13 84.20
KorQuAD (Dev) (EM/F1) 52.81/80.27 54.12/77.80 77.04/87.85

Citation

์ด ์ฝ”๋“œ๋ฅผ ์—ฐ๊ตฌ์šฉ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@misc{park2019distilkobert,
  author = {Park, Jangwon},
  title = {DistilKoBERT: Distillation of KoBERT},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/monologg/DistilKoBERT}}
}

Reference

More Repositories

1

JointBERT

Pytorch implementation of JointBERT: "BERT for Joint Intent Classification and Slot Filling"
Python
600
star
2

KoELECTRA

Pretrained ELECTRA Model for Korean
Python
584
star
3

R-BERT

Pytorch implementation of R-BERT: "Enriching Pre-trained Language Model with Entity Information for Relation Classification"
Python
333
star
4

KoBigBird

๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)
Python
201
star
5

KoBERT-Transformers

KoBERT on ๐Ÿค— Huggingface Transformers ๐Ÿค— (with Bug Fixed)
Python
190
star
6

GoEmotions-pytorch

Pytorch Implementation of GoEmotions ๐Ÿ˜๐Ÿ˜ข๐Ÿ˜ฑ
Python
142
star
7

KoBERT-NER

NER Task with KoBERT (with Naver NLP Challenge dataset)
Python
90
star
8

HanBert-Transformers

HanBert on ๐Ÿค— Huggingface Transformers ๐Ÿค—
Python
85
star
9

KoBERT-nsmc

Naver movie review sentiment classification with KoBERT
Python
76
star
10

transformers-android-demo

๐Ÿ“ฒ Transformers android examples (Tensorflow Lite & Pytorch Mobile)
Java
76
star
11

KoBERT-KorQuAD

Korean MRC (KorQuAD) with KoBERT
Python
66
star
12

nlp-arxiv-daily

Automatically Update NLP Papers Daily using Github Actions (ref: https://github.com/Vincentqyw/cv-arxiv-daily)
Python
63
star
13

EncT5

Pytorch Implementation of EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks
Python
58
star
14

NER-Multimodal-pytorch

Pytorch Implementation of "Adaptive Co-attention Network for Named Entity Recognition in Tweets" (AAAI 2018)
Python
56
star
15

KoCharELECTRA

Character-level Korean ELECTRA Model (์Œ์ ˆ ๋‹จ์œ„ ํ•œ๊ตญ์–ด ELECTRA)
Python
53
star
16

GoEmotions-Korean

Korean version of GoEmotions Dataset ๐Ÿ˜๐Ÿ˜ข๐Ÿ˜ฑ
Python
50
star
17

hashtag-prediction-pytorch

Multimodal Hashtag Prediction with instagram data & pytorch (2nd Place on OpenResource Hackathon 2019)
Python
47
star
18

KoELECTRA-Pipeline

Transformers Pipeline with KoELECTRA
Python
40
star
19

ko_lm_dataformat

A utility for storing and reading files for Korean LM training ๐Ÿ’พ
Python
36
star
20

korean-ner-pytorch

NER Task with CNN + BiLSTM + CRF (with Naver NLP Challenge dataset) with Pytorch
Python
27
star
21

korean-hate-speech-koelectra

Bias, Hate classification with KoELECTRA ๐Ÿ‘ฟ
Python
26
star
22

python-template

Python template code
Makefile
21
star
23

naver-nlp-challenge-2018

NER task for Naver NLP Challenge 2018 (3rd Place)
Python
19
star
24

BIO-R-BERT

R-BERT on DDI Bio dataset with BioBERT
Python
17
star
25

HanBert-NER

NER Task with HanBert (with Naver NLP Challenge dataset)
Python
16
star
26

kakaotrans

[Unofficial] Kakaotrans: Kakao translate API for python
Python
15
star
27

py-backtrans

Python library for backtranslation (with Google Translate)
Python
12
star
28

dotfiles

Simple setup for personal dotfiles
Shell
10
star
29

monologg

Profile repository
9
star
30

kobert2transformers

KoBERT to transformers library format
Python
7
star
31

ner-sample

NER Sample Code
Python
7
star
32

HanBert-nsmc

Naver movie review sentiment classification with HanBert
Python
4
star
33

torchserve-practice

Python
4
star
34

monologg.github.io

Personal Blog https://monologg.github.io
CSS
3
star