• Stars
    star
    185
  • Rank 208,271 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Distillation of KoBERT from SKTBrain (Lightweight KoBERT)

DistilKoBERT

Distillation of KoBERT (SKTBrain KoBERT ๊ฒฝ๋Ÿ‰ํ™”)

January 27th, 2020 - Update: 10GB์˜ Corpus๋ฅผ ๊ฐ€์ง€๊ณ  ์ƒˆ๋กœ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค. Subtask์—์„œ ์„ฑ๋Šฅ์ด ์†Œํญ ์ƒ์Šนํ–ˆ์Šต๋‹ˆ๋‹ค.
May 14th, 2020 - Update: ๊ธฐ์กด Transformers์˜ padding_idx ์ด์Šˆ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ์‚ฌํ•ญ์€ KoBERT-Transformers๋ฅผ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Pretraining DistilKoBERT

  • ๊ธฐ์กด์˜ 12 layer๋ฅผ 3 layer๋กœ ์ค„์˜€์œผ๋ฉฐ, ๊ธฐํƒ€ configuration์€ kobert๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ž์Šต๋‹ˆ๋‹ค.
  • Layer ์ดˆ๊ธฐํ™”์˜ ๊ฒฝ์šฐ ๊ธฐ์กด KoBERT์˜ 1, 5, 9๋ฒˆ์งธ layer ๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • Pretraining Corpus๋Š” ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‚˜๋ฌด์œ„ํ‚ค, ๋‰ด์Šค ๋“ฑ ์•ฝ 10GB์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, 3 epoch ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.

KoBERT / DistilKoBERT for transformers library

  • ๊ธฐ์กด์˜ KoBERT๋ฅผ transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งž์ท„์Šต๋‹ˆ๋‹ค.
    • transformers v2.2.2๋ถ€ํ„ฐ ๊ฐœ์ธ์ด ๋งŒ๋“  ๋ชจ๋ธ์„ transformers๋ฅผ ํ†ตํ•ด ์ง์ ‘ ์—…๋กœ๋“œ/๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
    • DistilKoBERT ์—ญ์‹œ transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ๊ณง๋ฐ”๋กœ ๋‹ค์šด ๋ฐ›์•„์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Dependencies

  • torch==1.1.0
  • transformers==2.9.1

How to Use

>>> from transformers import BertModel, DistilBertModel
>>> bert_model = BertModel.from_pretrained('monologg/kobert')
>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert')
  • Tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ๋ฃจํŠธ ๋””๋ ‰ํ† ๋ฆฌ์˜ tokenization_kobert.py ํŒŒ์ผ์„ ๋ณต์‚ฌํ•œ ํ›„, KoBertTokenizer๋ฅผ ์ž„ํฌํŠธํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.
    • KoBERT์™€ DistilKoBERT ๋ชจ๋‘ ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๊ธฐ์กด KoBERT์˜ ๊ฒฝ์šฐ Special Token์ด ์ œ๋Œ€๋กœ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š๋Š” ์ด์Šˆ๊ฐ€ ์žˆ์–ด์„œ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ˆ˜์ •ํ•˜์—ฌ ๋ฐ˜์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (Issue link)
>>> from tokenization_kobert import KoBertTokenizer
>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋„ ๋™์ผ
>>> tokenizer.tokenize("[CLS] ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. [SEP]")
['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]'])
[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

What is different between BERT and DistilBERT

  • DistilBert๋Š” ๊ธฐ์กด์˜ Bert์™€ ๋‹ฌ๋ฆฌ token-type embedding์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    • Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ DistilBertModel์„ ์‚ฌ์šฉํ•  ๋•Œ ๊ธฐ์กด BertModel ๊ณผ ๋‹ฌ๋ฆฌ token_type_ids๋ฅผ ๋„ฃ์„ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ DistilBert๋Š” pooler๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    • ๊ณ ๋กœ ๊ธฐ์กด BertModel์˜ ๊ฒฝ์šฐ forward์˜ return ๊ฐ’์œผ๋กœ sequence_output, pooled_output, (hidden_states), (attentions)์„ ๋ฝ‘์•„๋‚ด์ง€๋งŒ, DistilBertModel์˜ ๊ฒฝ์šฐ sequence_output, (hidden_states), (attentions)๋ฅผ ๋ฝ‘์•„๋ƒ…๋‹ˆ๋‹ค.
    • DistilBert์—์„œ [CLS] ํ† ํฐ์„ ๋ฝ‘์•„๋‚ด๋ ค๋ฉด sequence_output[0][:, 0]๋ฅผ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Result on Sub-task

KoBERT DistilKoBERT Bert-multilingual
Model Size (MB) 351 108 681
NSMC (acc) 89.63 88.41 87.07
Naver NER (F1) 86.11 84.13 84.20
KorQuAD (Dev) (EM/F1) 52.81/80.27 54.12/77.80 77.04/87.85

Citation

์ด ์ฝ”๋“œ๋ฅผ ์—ฐ๊ตฌ์šฉ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@misc{park2019distilkobert,
  author = {Park, Jangwon},
  title = {DistilKoBERT: Distillation of KoBERT},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/monologg/DistilKoBERT}}
}

Reference

More Repositories

1

JointBERT

Pytorch implementation of JointBERT: "BERT for Joint Intent Classification and Slot Filling"
Python
652
star
2

KoELECTRA

Pretrained ELECTRA Model for Korean
Python
598
star
3

R-BERT

Pytorch implementation of R-BERT: "Enriching Pre-trained Language Model with Entity Information for Relation Classification"
Python
347
star
4

KoBigBird

๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)
Python
202
star
5

KoBERT-Transformers

KoBERT on ๐Ÿค— Huggingface Transformers ๐Ÿค— (with Bug Fixed)
Python
201
star
6

GoEmotions-pytorch

Pytorch Implementation of GoEmotions ๐Ÿ˜๐Ÿ˜ข๐Ÿ˜ฑ
Python
150
star
7

KoBERT-NER

NER Task with KoBERT (with Naver NLP Challenge dataset)
Python
94
star
8

HanBert-Transformers

HanBert on ๐Ÿค— Huggingface Transformers ๐Ÿค—
Python
86
star
9

nlp-arxiv-daily

Automatically Update NLP Papers Daily using Github Actions (ref: https://github.com/Vincentqyw/cv-arxiv-daily)
Python
79
star
10

transformers-android-demo

๐Ÿ“ฒ Transformers android examples (Tensorflow Lite & Pytorch Mobile)
Java
78
star
11

KoBERT-nsmc

Naver movie review sentiment classification with KoBERT
Python
76
star
12

KoBERT-KorQuAD

Korean MRC (KorQuAD) with KoBERT
Python
64
star
13

EncT5

Pytorch Implementation of EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks
Python
62
star
14

NER-Multimodal-pytorch

Pytorch Implementation of "Adaptive Co-attention Network for Named Entity Recognition in Tweets" (AAAI 2018)
Python
57
star
15

KoCharELECTRA

Character-level Korean ELECTRA Model (์Œ์ ˆ ๋‹จ์œ„ ํ•œ๊ตญ์–ด ELECTRA)
Python
53
star
16

GoEmotions-Korean

Korean version of GoEmotions Dataset ๐Ÿ˜๐Ÿ˜ข๐Ÿ˜ฑ
Python
52
star
17

hashtag-prediction-pytorch

Multimodal Hashtag Prediction with instagram data & pytorch (2nd Place on OpenResource Hackathon 2019)
Python
47
star
18

KoELECTRA-Pipeline

Transformers Pipeline with KoELECTRA
Python
40
star
19

ko_lm_dataformat

A utility for storing and reading files for Korean LM training ๐Ÿ’พ
Python
36
star
20

korean-ner-pytorch

NER Task with CNN + BiLSTM + CRF (with Naver NLP Challenge dataset) with Pytorch
Python
30
star
21

korean-hate-speech-koelectra

Bias, Hate classification with KoELECTRA ๐Ÿ‘ฟ
Python
26
star
22

python-template

Python template code
Makefile
21
star
23

naver-nlp-challenge-2018

NER task for Naver NLP Challenge 2018 (3rd Place)
Python
18
star
24

BIO-R-BERT

R-BERT on DDI Bio dataset with BioBERT
Python
17
star
25

kakaotrans

[Unofficial] Kakaotrans: Kakao translate API for python
Python
16
star
26

HanBert-NER

NER Task with HanBert (with Naver NLP Challenge dataset)
Python
16
star
27

py-backtrans

Python library for backtranslation (with Google Translate)
Python
12
star
28

dotfiles

Simple setup for personal dotfiles
Shell
11
star
29

monologg

Profile repository
9
star
30

ner-sample

NER Sample Code
Python
7
star
31

HanBert-nsmc

Naver movie review sentiment classification with HanBert
Python
4
star
32

torchserve-practice

Python
4
star
33

monologg.github.io

Personal Blog https://monologg.github.io
CSS
3
star