• Stars
    star
    584
  • Rank 73,590 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pretrained ELECTRA Model for Korean

ํ•œ๊ตญ์–ด | English

KoELECTRA

ELECTRA๋Š” Replaced Token Detection, ์ฆ‰ generator์—์„œ ๋‚˜์˜จ token์„ ๋ณด๊ณ  discriminator์—์„œ "real" token์ธ์ง€ "fake" token์ธ์ง€ ํŒ๋ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ•™์Šต์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๋ชจ๋“  input token์— ๋Œ€ํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์„ ๊ฐ€์ง€๋ฉฐ, BERT ๋“ฑ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

KoELECTRA๋Š” 34GB์˜ ํ•œ๊ตญ์–ด text๋กœ ํ•™์Šตํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ KoELECTRA-Base์™€ KoELECTRA-Small ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ KoELECTRA๋Š” Wordpiece ์‚ฌ์šฉ, ๋ชจ๋ธ s3 ์—…๋กœ๋“œ ๋“ฑ์„ ํ†ตํ•ด OS ์ƒ๊ด€์—†์ด Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋งŒ ์„ค์น˜ํ•˜๋ฉด ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Download Link

Model Discriminator Generator Tensorflow-v1
KoELECTRA-Base-v1 Discriminator Generator Tensorflow-v1
KoELECTRA-Small-v1 Discriminator Generator Tensorflow-v1
KoELECTRA-Base-v2 Discriminator Generator Tensorflow-v1
KoELECTRA-Small-v2 Discriminator Generator Tensorflow-v1
KoELECTRA-Base-v3 Discriminator Generator Tensorflow-v1
KoELECTRA-Small-v3 Discriminator Generator Tensorflow-v1

About KoELECTRA

Layers Embedding Size Hidden Size # heads
KoELECTRA-Base Discriminator 12 768 768 12
Generator 12 768 256 4
KoELECTRA-Small Discriminator 12 128 256 4
Generator 12 128 256 4

Vocabulary

  • ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ์˜ ๊ฐ€์žฅ ํฐ ๋ชฉ์ ์€ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋งŒ ์žˆ์œผ๋ฉด ๋ชจ๋ธ์„ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด์—ˆ๊ณ , ์ด์— Sentencepiece, Mecab์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์› ๋…ผ๋ฌธ๊ณผ ์ฝ”๋“œ์—์„œ ์‚ฌ์šฉํ•œ Wordpiece๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ์ž์„ธํ•œ ๋‚ด์šฉ์€ [Wordpiece Vocabulary] ์ฐธ๊ณ 
Vocab Len do_lower_case
v1 32200 False
v2 32200 False
v3 35000 False

Data

  • v1, v2์˜ ๊ฒฝ์šฐ ์•ฝ 14G Corpus (2.6B tokens)๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (๋‰ด์Šค, ์œ„ํ‚ค, ๋‚˜๋ฌด์œ„ํ‚ค)
  • v3์˜ ๊ฒฝ์šฐ ์•ฝ 20G์˜ ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (์‹ ๋ฌธ, ๋ฌธ์–ด, ๊ตฌ์–ด, ๋ฉ”์‹ ์ €, ์›น)

Pretraining Details

Model Batch Size Train Steps LR Max Seq Len Generator Size Train Time
Base v1,2 256 700K 2e-4 512 0.33 7d
Base v3 256 1.5M 2e-4 512 0.33 14d
Small v1,2 512 300K 5e-4 512 1.0 3d
Small v3 512 800K 5e-4 512 1.0 7d
  • KoELECTRA-Small ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์› ๋…ผ๋ฌธ์—์„œ์˜ ELECTRA-Small++์™€ ๋™์ผํ•œ ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • ์ด๋Š” ๊ณต์‹ ELECTRA์—์„œ ๋ฐฐํฌํ•œ Small ๋ชจ๋ธ๊ณผ ์„ค์ •์ด ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
    • ๋˜ํ•œ KoELECTRA-Base์™€๋Š” ๋‹ฌ๋ฆฌ, Generator์™€ Discriminator์˜ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ(=generator_hidden_size)๊ฐ€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
  • Batch size์™€ Train steps์„ ์ œ์™ธํ•˜๊ณ ๋Š” ์› ๋…ผ๋ฌธ์˜ Hyperparameter์™€ ๋™์ผํ•˜๊ฒŒ ๊ฐ€์ ธ๊ฐ”์Šต๋‹ˆ๋‹ค.

    • ๋‹ค๋ฅธ hyperparameter๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๋Œ๋ ค๋ดค์ง€๋งŒ ์› ๋…ผ๋ฌธ๊ณผ ๋™์ผํ•˜๊ฒŒ ๊ฐ€์ ธ๊ฐ„ ๊ฒƒ์ด ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜์Šต๋‹ˆ๋‹ค.
  • TPU v3-8์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šตํ•˜์˜€๊ณ , GCP์—์„œ์˜ TPU ์‚ฌ์šฉ๋ฒ•์€ [Using TPU for Pretraining]์— ์ •๋ฆฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

KoELECTRA on ๐Ÿค— Transformers ๐Ÿค—

  • Transformers v2.8.0๋ถ€ํ„ฐ ElectraModel์„ ๊ณต์‹ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  • Huggingface S3์— ๋ชจ๋ธ์ด ์ด๋ฏธ ์—…๋กœ๋“œ๋˜์–ด ์žˆ์–ด์„œ, ๋ชจ๋ธ์„ ์ง์ ‘ ๋‹ค์šด๋กœ๋“œํ•  ํ•„์š” ์—†์ด ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ElectraModel์€ pooled_output์„ ๋ฆฌํ„ดํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ์ œ์™ธํ•˜๊ณ  BertModel๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  • ELECTRA๋Š” finetuning์‹œ์— discriminator๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

1. Pytorch Model & Tokenizer

from transformers import ElectraModel, ElectraTokenizer

model = ElectraModel.from_pretrained("monologg/koelectra-base-discriminator")  # KoELECTRA-Base
model = ElectraModel.from_pretrained("monologg/koelectra-small-discriminator")  # KoELECTRA-Small
model = ElectraModel.from_pretrained("monologg/koelectra-base-v2-discriminator")  # KoELECTRA-Base-v2
model = ElectraModel.from_pretrained("monologg/koelectra-small-v2-discriminator")  # KoELECTRA-Small-v2
model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")  # KoELECTRA-Base-v3
model = ElectraModel.from_pretrained("monologg/koelectra-small-v3-discriminator")  # KoELECTRA-Small-v3

2. Tensorflow v2 Model

from transformers import TFElectraModel

model = TFElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator", from_pt=True)

3. Tokenizer Example

>>> from transformers import ElectraTokenizer
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer.tokenize("[CLS] ํ•œ๊ตญ์–ด ELECTRA๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. [SEP]")
['[CLS]', 'ํ•œ๊ตญ์–ด', 'EL', '##EC', '##TRA', '##๋ฅผ', '๊ณต์œ ', '##ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'ํ•œ๊ตญ์–ด', 'EL', '##EC', '##TRA', '##๋ฅผ', '๊ณต์œ ', '##ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]'])
[2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3]

Result on Subtask

config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฝ”๋“œ ๋ฐ ์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetuning] ์ฐธ๊ณ 

Base Model

NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
Korean-Hate-Speech (Dev)
(F1)
KoBERT 89.59 87.92 81.25 79.62 81.59 94.85 51.75 / 79.15 66.21
XLM-Roberta-Base 89.03 86.65 82.80 80.23 78.45 93.80 64.70 / 88.94 64.06
HanBERT 90.06 87.70 82.95 80.32 82.73 94.72 78.74 / 92.02 68.32
KoELECTRA-Base 90.33 87.18 81.70 80.64 82.00 93.54 60.86 / 89.28 66.09
KoELECTRA-Base-v2 89.56 87.16 80.70 80.72 82.30 94.85 84.01 / 92.40 67.45
KoELECTRA-Base-v3 90.63 88.11 84.45 82.24 85.53 95.25 84.83 / 93.45 67.61

Small Model

NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
Korean-Hate-Speech (Dev)
(F1)
DistilKoBERT 88.60 84.65 60.50 72.00 72.59 92.48 54.40 / 77.97 60.72
KoELECTRA-Small 88.83 84.38 73.10 76.45 76.56 93.01 58.04 / 86.76 63.03
KoELECTRA-Small-v2 88.83 85.00 72.35 78.14 77.84 93.27 81.43 / 90.46 60.14
KoELECTRA-Small-v3 89.36 85.40 77.45 78.60 80.79 94.85 82.11 / 91.13 63.07

Updates

April 27, 2020

  • 2๊ฐœ์˜ Subtask (KorSTS, QuestionPair)์— ๋Œ€ํ•ด ์ถ”๊ฐ€์ ์œผ๋กœ finetuning์„ ์ง„ํ–‰ํ•˜์˜€๊ณ , ๊ธฐ์กด 5๊ฐœ์˜ Subtask์— ๋Œ€ํ•ด์„œ๋„ ๊ฒฐ๊ณผ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์˜€์Šต๋‹ˆ๋‹ค.

June 3, 2020

  • EnlipleAI PLM์—์„œ ์‚ฌ์šฉ๋œ vocabulary๋ฅผ ์ด์šฉํ•˜์—ฌ KoELECTRA-v2๋ฅผ ์ œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค. Base ๋ชจ๋ธ๊ณผ Small ๋ชจ๋ธ ๋ชจ๋‘ KorQuaD์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

October 9, 2020

  • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ KoELECTRA-v3๋ฅผ ์ œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค. Vocab๋„ Mecab๊ณผ Wordpiece๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ ์ œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • Huggingface Transformers์˜ ElectraForSequenceClassification ๊ณต์‹ ์ง€์› ๋“ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ๊ธฐ์กด Subtask ๊ฒฐ๊ณผ๋ฅผ ์ƒˆ๋กœ Updateํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Korean-Hate-Speech์˜ ๊ฒฐ๊ณผ๋„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
from transformers import ElectraModel, ElectraTokenizer

model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

May 26, 2021

  • torch<=1.4 ์—์„œ ๋กœ๋”ฉ์ด ๋˜์ง€ ์•Š๋Š” ์ด์Šˆ ํ•ด๊ฒฐ (๋ชจ๋ธ ์ˆ˜์ • ํ›„ ์žฌ์—…๋กœ๋“œ ์™„๋ฃŒ) (Related Issue)
  • huggingface hub์— tensorflow v2 ๋ชจ๋ธ ์—…๋กœ๋“œ (tf_model.h5)

Oct 20, 2021

  • tf_model.h5์—์„œ ๋ฐ”๋กœ ๋กœ๋”ฉํ•˜๋Š” ๋ถ€๋ถ„์ด ์—ฌ๋Ÿฌ ์ด์Šˆ๊ฐ€ ์กด์žฌํ•˜์—ฌ ์ œ๊ฑฐ (from_pt=True๋กœ ๋กœ๋”ฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋˜๋Œ๋ฆผ)

Acknowledgement

KoELECTRA์€ Tensorflow Research Cloud (TFRC) ํ”„๋กœ๊ทธ๋žจ์˜ Cloud TPU ์ง€์›์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ KoELECTRA-v3๋Š” ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜์˜ ๋„์›€์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Citation

์ด ์ฝ”๋“œ๋ฅผ ์—ฐ๊ตฌ์šฉ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@misc{park2020koelectra,
  author = {Park, Jangwon},
  title = {KoELECTRA: Pretrained ELECTRA Model for Korean},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/monologg/KoELECTRA}}
}

Reference

More Repositories

1

JointBERT

Pytorch implementation of JointBERT: "BERT for Joint Intent Classification and Slot Filling"
Python
600
star
2

R-BERT

Pytorch implementation of R-BERT: "Enriching Pre-trained Language Model with Entity Information for Relation Classification"
Python
333
star
3

KoBigBird

๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)
Python
201
star
4

KoBERT-Transformers

KoBERT on ๐Ÿค— Huggingface Transformers ๐Ÿค— (with Bug Fixed)
Python
190
star
5

DistilKoBERT

Distillation of KoBERT from SKTBrain (Lightweight KoBERT)
Python
180
star
6

GoEmotions-pytorch

Pytorch Implementation of GoEmotions ๐Ÿ˜๐Ÿ˜ข๐Ÿ˜ฑ
Python
142
star
7

KoBERT-NER

NER Task with KoBERT (with Naver NLP Challenge dataset)
Python
90
star
8

HanBert-Transformers

HanBert on ๐Ÿค— Huggingface Transformers ๐Ÿค—
Python
85
star
9

KoBERT-nsmc

Naver movie review sentiment classification with KoBERT
Python
76
star
10

transformers-android-demo

๐Ÿ“ฒ Transformers android examples (Tensorflow Lite & Pytorch Mobile)
Java
76
star
11

KoBERT-KorQuAD

Korean MRC (KorQuAD) with KoBERT
Python
66
star
12

nlp-arxiv-daily

Automatically Update NLP Papers Daily using Github Actions (ref: https://github.com/Vincentqyw/cv-arxiv-daily)
Python
63
star
13

EncT5

Pytorch Implementation of EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks
Python
58
star
14

NER-Multimodal-pytorch

Pytorch Implementation of "Adaptive Co-attention Network for Named Entity Recognition in Tweets" (AAAI 2018)
Python
56
star
15

KoCharELECTRA

Character-level Korean ELECTRA Model (์Œ์ ˆ ๋‹จ์œ„ ํ•œ๊ตญ์–ด ELECTRA)
Python
53
star
16

GoEmotions-Korean

Korean version of GoEmotions Dataset ๐Ÿ˜๐Ÿ˜ข๐Ÿ˜ฑ
Python
50
star
17

hashtag-prediction-pytorch

Multimodal Hashtag Prediction with instagram data & pytorch (2nd Place on OpenResource Hackathon 2019)
Python
47
star
18

KoELECTRA-Pipeline

Transformers Pipeline with KoELECTRA
Python
40
star
19

ko_lm_dataformat

A utility for storing and reading files for Korean LM training ๐Ÿ’พ
Python
36
star
20

korean-ner-pytorch

NER Task with CNN + BiLSTM + CRF (with Naver NLP Challenge dataset) with Pytorch
Python
27
star
21

korean-hate-speech-koelectra

Bias, Hate classification with KoELECTRA ๐Ÿ‘ฟ
Python
26
star
22

python-template

Python template code
Makefile
21
star
23

naver-nlp-challenge-2018

NER task for Naver NLP Challenge 2018 (3rd Place)
Python
19
star
24

BIO-R-BERT

R-BERT on DDI Bio dataset with BioBERT
Python
17
star
25

HanBert-NER

NER Task with HanBert (with Naver NLP Challenge dataset)
Python
16
star
26

kakaotrans

[Unofficial] Kakaotrans: Kakao translate API for python
Python
15
star
27

py-backtrans

Python library for backtranslation (with Google Translate)
Python
12
star
28

dotfiles

Simple setup for personal dotfiles
Shell
10
star
29

monologg

Profile repository
9
star
30

kobert2transformers

KoBERT to transformers library format
Python
7
star
31

ner-sample

NER Sample Code
Python
7
star
32

HanBert-nsmc

Naver movie review sentiment classification with HanBert
Python
4
star
33

torchserve-practice

Python
4
star
34

monologg.github.io

Personal Blog https://monologg.github.io
CSS
3
star