• Stars
    star
    232
  • Rank 172,847 (Top 4 %)
  • Language
  • License
    MIT License
  • Created over 3 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿค— Korean Comments ELECTRA: ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ•™์Šตํ•œ ELECTRA ๋ชจ๋ธ

KcELECTRA: Korean comments ELECTRA

** Updates on 2022.11.07 **

  • KcELECTRA v2022 ํ•™์Šต์— ์‚ฌ์šฉํ•œ, ํ™•์žฅ๋œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹(v2022.3Q)๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.
  • https://github.com/Beomi/KcBERT/releases/tag/v2022.3Q
  • ๊ธฐ์กด 11GB -> ์‹ ๊ทœ 45GB, ๊ธฐ์กด 0.9์–ต๊ฑด -> ์‹ ๊ทœ 3.4์–ต๊ฑด์œผ๋กœ ๊ธฐ์กด v1 ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๋น„ ์•ฝ 4๋ฐฐ ์ฆ๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

** Updates on 2022.10.08 **

  • KcELECTRA-base-v2022 (๊ตฌ v2022-dev) ๋ชจ๋ธ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์œ„ ๋ชจ๋ธ์˜ ์„ธ๋ถ€ ์Šค์ฝ”์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ๊ธฐ์กด KcELECTRA-base(v2021) ๋Œ€๋น„ ๋Œ€๋ถ€๋ถ„์˜ downstream task์—์„œ ~1%p ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด Transformer ๊ณ„์—ด ๋ชจ๋ธ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ User-Generated Noisy text domain ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

KcELECTRA๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์˜จ๋ผ์ธ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ ELECTRA๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained ELECTRA ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๊ธฐ์กด KcBERT ๋Œ€๋น„ ๋ฐ์ดํ„ฐ์…‹ ์ฆ๊ฐ€ ๋ฐ vocab ํ™•์žฅ์„ ํ†ตํ•ด ์ƒ๋‹นํ•œ ์ˆ˜์ค€์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

KcELECTRA๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

๐Ÿ’ก NOTE ๐Ÿ’ก 
General Corpus๋กœ ํ•™์Šตํ•œ KoELECTRA๊ฐ€ ๋ณดํŽธ์ ์ธ task์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋” ์ž˜ ๋‚˜์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.
KcBERT/KcELECTRA๋Š” User genrated, Noisy text์— ๋Œ€ํ•ด์„œ ๋ณด๋‹ค ์ž˜ ๋™์ž‘ํ•˜๋Š” PLM์ž…๋‹ˆ๋‹ค.

KcELECTRA Performance

  • Finetune ์ฝ”๋“œ๋Š” https://github.com/Beomi/KcBERT-finetune ์—์„œ ์ฐพ์•„๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•ด๋‹น Repo์˜ ๊ฐ Checkpoint ํด๋”์—์„œ Step๋ณ„ ์„ธ๋ถ€ ์Šค์ฝ”์–ด๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Size
(์šฉ๋Ÿ‰)
NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
KcELECTRA-base-v2022 475M 91.97 87.35 76.50 82.12 83.67 95.12 69.00 / 90.40
KcELECTRA-base 475M 91.71 86.90 74.80 81.65 82.65 95.78 70.60 / 90.11
KcBERT-Base 417M 89.62 84.34 66.95 74.85 75.57 93.93 60.25 / 84.39
KcBERT-Large 1.2G 90.68 85.53 70.15 76.99 77.49 94.06 62.16 / 86.64
KoBERT 351M 89.63 86.11 80.65 79.00 79.64 93.93 52.81 / 80.27
XLM-Roberta-Base 1.03G 89.49 86.26 82.95 79.92 79.09 93.53 64.70 / 88.94
HanBERT 614M 90.16 87.31 82.40 80.89 83.33 94.19 78.74 / 92.02
KoELECTRA-Base 423M 90.21 86.87 81.90 80.85 83.21 94.20 61.10 / 89.59
KoELECTRA-Base-v2 423M 89.70 87.02 83.90 80.61 84.30 94.72 84.34 / 92.58
KoELECTRA-Base-v3 423M 90.63 88.11 84.45 82.24 85.53 95.25 84.83 / 93.45
DistilKoBERT 108M 88.41 84.13 62.55 70.55 73.21 92.48 54.12 / 77.80

*HanBERT์˜ Size๋Š” Bert Model๊ณผ Tokenizer DB๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

*config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

How to use

Requirements

  • pytorch ~= 1.8.0
  • transformers ~= 4.11.3
  • emoji ~= 0.6.0
  • soynlp ~= 0.0.493

Default usage

from transformers import AutoTokenizer, AutoModel


tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base-v2022")
model = AutoModel.from_pretrained("beomi/KcELECTRA-base-v2022")

# ๊ตฌ๋ฒ„์ „(v2021)์„ ์‚ฌ์šฉํ•˜๊ธฐ ์›ํ•˜์‹ค ๊ฒฝ์šฐ
#tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base")
#model = AutoModel.from_pretrained("beomi/KcELECTRA-base")

๐Ÿ’ก ์ด์ „ KcBERT ๊ด€๋ จ ์ฝ”๋“œ๋“ค์—์„œ AutoTokenizer, AutoModel ์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ .from_pretrained("beomi/kcbert-base") ๋ถ€๋ถ„์„ .from_pretrained("beomi/KcELECTRA-base") ๋กœ๋งŒ ๋ณ€๊ฒฝํ•ด์ฃผ์‹œ๋ฉด ์ฆ‰์‹œ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Pretrain & Finetune Colab ๋งํฌ ๋ชจ์Œ

Pretrain Data

  • KcBERTํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ + ์ดํ›„ 2021.03์›” ์ดˆ๊นŒ์ง€ ์ˆ˜์ง‘ํ•œ ๋Œ“๊ธ€
    • ์•ฝ 17GB
    • ๋Œ“๊ธ€-๋Œ€๋Œ“๊ธ€์„ ๋ฌถ์€ ๊ธฐ๋ฐ˜์œผ๋กœ Document ๊ตฌ์„ฑ

Pretrain Code

Finetune Code

Finetune Samples

  • NSMC with PyTorch-Lightning 1.3.0, GPU, Colab Open In Colab

Train Data & Preprocessing

Raw Data

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2021.03.09 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ ๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค/ํ˜น์€ ์ „์ฒด ๋‰ด์Šค ๊ธฐ์‚ฌ๋“ค์˜ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ ์•ฝ 17.3GB์ด๋ฉฐ, 1์–ต8์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

KcBERT๋Š” 2019.01-2020.06์˜ ํ…์ŠคํŠธ๋กœ, ์ •์ œ ํ›„ ์•ฝ 9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Preprocessing

PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!

    ์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.

    ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ ์œผ๋กœ ์ง€์ •ํ•ด ใ„ฑ-ํžฃ ๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  2. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ

    ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ ใ…‹ใ…‹์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

  3. Cased Model

    KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.

  4. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ

    10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  5. ์ค‘๋ณต ์ œ๊ฑฐ

    ์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์™„์ „ํžˆ ์ผ์น˜ํ•˜๋Š” ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

  6. OOO ์ œ๊ฑฐ

    ์ผ๋ถ€ ํฌํƒˆ ๋Œ“๊ธ€์˜ ๊ฒฝ์šฐ, ๋น„์†์–ด๋Š” ์ž์ฒด ํ•„ํ„ฐ๋ง์„ ํ†ตํ•ด OOO ๋กœ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์„ ๊ณต๋ฐฑ์œผ๋กœ ์ œ๊ฑฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. ([UNK] ๊ฐ์†Œ)

pip install soynlp emoji

์•„๋ž˜ clean ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.

import re
import emoji
from soynlp.normalizer import repeat_normalize

emojis = ''.join(emoji.UNICODE_EMOJI.keys())
pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ{emojis}]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

import re
import emoji
from soynlp.normalizer import repeat_normalize

pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x): 
    x = pattern.sub(' ', x)
    x = emoji.replace_emoji(x, replace='') #emoji ์‚ญ์ œ
    x = url_pattern.sub('', x)
    x = x.strip()
    x = repeat_normalize(x, num_repeats=2)
    return x

๐Ÿ’ก Finetune Score์—์„œ๋Š” ์œ„ clean ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

Cleaned Data

  • KcBERT ์™ธ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ๋Š” ์ •๋ฆฌ ํ›„ ๊ณต๊ฐœ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

Tokenizer, Model Train

Tokenizer๋Š” Huggingface์˜ Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘ BertWordPieceTokenizer ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š” 30000์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ชจ๋ธ์˜ General Downstream task์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด KoELECTRA์—์„œ ์‚ฌ์šฉํ•œ Vocab์„ ๊ฒน์น˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„์„ ์ถ”๊ฐ€๋กœ ๋„ฃ์–ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. (์‹ค์ œ๋กœ ๋‘ ๋ชจ๋ธ์ด ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์€ ์•ฝ 5000ํ† ํฐ์ด์—ˆ์Šต๋‹ˆ๋‹ค.)

TPU v3-8 ์„ ์ด์šฉํ•ด ์•ฝ 10์ผ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 848k step์„ ํ•™์Šตํ•œ ๋ชจ๋ธ weight๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

(100k step๋ณ„ Checkpoint๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ถ€๋ถ„์€ KcBERT-finetune repo๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.)

๋ชจ๋ธ ํ•™์Šต Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 100-200k ์‚ฌ์ด์— ๊ธ‰๊ฒฉํžˆ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค ํ•™์Šต ์ข…๋ฃŒ๊นŒ์ง€๋„ ์ง€์†์ ์œผ๋กœ loss๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

KcELECTRA-base Pretrain Loss

KcELECTRA Pretrain Step๋ณ„ Downstream task ์„ฑ๋Šฅ ๋น„๊ต

๐Ÿ’ก ์•„๋ž˜ ํ‘œ๋Š” ์ „์ฒด ckpt๊ฐ€ ์•„๋‹Œ ์ผ๋ถ€์— ๋Œ€ํ•ด์„œ๋งŒ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

KcELECTRA Pretrain Step๋ณ„ Downstream task ์„ฑ๋Šฅ ๋น„๊ต

  • ์œ„์™€ ๊ฐ™์ด KcBERT-base, KcBERT-large ๋Œ€๋น„ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด KcELECTRA-base๊ฐ€ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • KcELECTRA pretrain์—์„œ๋„ Train step์ด ๋Š˜์–ด๊ฐ์— ๋”ฐ๋ผ ์ ์ง„์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ธ์šฉํ‘œ๊ธฐ/Citation

KcELECTRA๋ฅผ ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ์•„๋ž˜ ์–‘์‹์„ ํ†ตํ•ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@misc{lee2021kcelectra,
  author = {Junbum Lee},
  title = {KcELECTRA: Korean comments ELECTRA},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Beomi/KcELECTRA}}
}

๋…ผ๋ฌธ์„ ํ†ตํ•œ ์‚ฌ์šฉ ์™ธ์—๋Š” MIT ๋ผ์ด์„ผ์Šค๋ฅผ ํ‘œ๊ธฐํ•ด์ฃผ์„ธ์š”. โ˜บ๏ธ

Acknowledgement

KcELECTRA Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ TFRC ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  Monologg ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

Reference

Github Repos

Blogs

More Repositories

1

KoAlpaca

KoAlpaca: ํ•œ๊ตญ์–ด ๋ช…๋ น์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค ์–ธ์–ด๋ชจ๋ธ
Jupyter Notebook
1,529
star
2

KcBERT

๐Ÿค— Pretrained BERT model & WordPiece tokenizer trained on Korean Comments ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ํ•œ BERT ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹
469
star
3

InfiniTransformer

Unofficial PyTorch/๐Ÿค—Transformers(Gemma/Llama3) implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Python
318
star
4

BitNet-Transformers

0๏ธโƒฃ1๏ธโƒฃ๐Ÿค— BitNet-Transformers: Huggingface Transformers Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch with Llama(2) Architecture
Python
255
star
5

cite.gg

๐Ÿ” ์šฐ๋ฆฌ๊ฐ€ ์ฝ์„ ๋…ผ๋ฌธ์„ ์ฐพ์•„์„œ, Cite.GG
Vue
85
star
6

ko-lm-evaluation-harness

Forked repo from https://github.com/EleutherAI/lm-evaluation-harness/commit/1f66adc
Python
58
star
7

easy-lm-trainer

๐Ÿค— ์ตœ์†Œํ•œ์˜ ์„ธํŒ…์œผ๋กœ LM์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์ƒ˜ํ”Œ์ฝ”๋“œ
Python
57
star
8

simple_bank_korea

simple crawler for Korean banks with Transactions
Python
56
star
9

pypapago

๐Ÿง[Archived][Unofficial] Python wrapper for Papago translation service
Python
47
star
10

Gemma-EasyLM

Train GEMMA on TPU/GPU! (Codebase for training Gemma-Ko Series)
Python
40
star
11

KcBERT-Finetune

KcBERT/KcELECTRA Fine Tune Benchmarks code (forked from https://github.com/monologg/KoELECTRA/tree/master/finetune)
Python
39
star
12

aws-lambda-py3

Pre-compiled Python3(3.6+) packages for AWS Lambda layers
Dockerfile
29
star
13

transformers-language-modeling

Train ๐Ÿค—transformers with DeepSpeed: ZeRO-2, ZeRO-3
Python
20
star
14

irkshop

Django Shopping Mall WebSite for IRK / DEMO:
JavaScript
17
star
15

gptpy

GPTPy: Your kind Python guide, powered by AI to fix errors and explain code
Python
14
star
16

data_camp_wcr

ํŒŒ์ด์ฌ์„ ํ™œ์šฉํ•œ ์‹ค์ „ ์›นํฌ๋กค๋ง CAMP ๊ฐ•์˜ 1-2๊ธฐ ์†Œ์Šค์ฝ”๋“œ
Python
11
star
17

data_camp_wcr_3

ํŒŒ์ด์ฌ์„ ํ™œ์šฉํ•œ ์‹ค์ „ ์›นํฌ๋กค๋ง CAMP 3๊ธฐ ์†Œ์Šค์ฝ”๋“œ
Python
10
star
18

exbert-transformers

exBERT on Transformers๐Ÿค—
Python
10
star
19

django-tube

์žฅ๊ณ ๊ฑธ์Šค ์„œ์šธ ํŠœํ† ๋ฆฌ์–ผ ๋ฏธ๋‹ˆ์‹œ๋ฆฌ์ฆˆ: DjangoTube
9
star
20

where-is-my-customs

๋‚ด ํ†ต๊ด€์€ ์–ด๋””์ฏค? ์นด์นด์˜คํ†ก ๋ด‡
Python
8
star
21

naverNewsContentsCrawler

naverNewsContentsCrawler
Python
7
star
22

beomi.github.io_2019

Beomi's Tech Blog (Deprecated)
HTML
7
star
23

SlideShare_Character_Updater

Prevents [SlideShare Korean(or UTF-8 langs) Character "Blank"] phenomenon
JavaScript
7
star
24

megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with ๐Ÿค—Transformers AutoTokenizer.
Python
6
star
25

Wifibucks

KT Starbucks Wi-Fi Auto Connector
JavaScript
6
star
26

beomi.github.io

Beomi's Tech Blog (HTML Only)
HTML
5
star
27

s3-direct-uploader-demo

Direct S3 Upload with Lambda Demo
HTML
4
star
28

kb_transaction

KB ๊ตญ๋ฏผ์€ํ–‰ ๊ฑฐ๋ž˜๋‚ด์—ญ ์กฐํšŒ
Python
4
star
29

deepo-nlp

Dockerfile for NLP(Korean) based on Deepo
Dockerfile
4
star
30

typora-s3-uploader

CLI file/img S3 uploader for Typora
Python
2
star
31

lambda-requirements-builder

Simple Script Example to Build Lambda Layers (with Python3)
Shell
1
star
32

tf-keras-on-lambda

Example Repo for Tensorflow + Keras on AWS Lambda
Python
1
star
33

SmartSpamFilter-QnA

Smart Spam Filter QnA
1
star
34

heroku-helloworld

Test DRF Project for Heroku Deploy
Python
1
star
35

UpdateNoticeBOT

Web site update alert BOT with telegram, made with Python and Celery
Python
1
star
36

Evergreen-SBERT

๐ŸŒฒEvergreen-SBERT: Extensible Sentence BERT with Up-to-Date Corpora
Jupyter Notebook
1
star
37

transformers-pytorch-gpu

๐Ÿ’ก Docker image for Huggingface ๐Ÿค— Transformers + GPU + Jupyter notebook + OhMyZsh
Jupyter Notebook
1
star
38

PYCON2017

2017 ํŒŒ์ด์ฝ˜ ๋ฐœํ‘œ์ž๋ฃŒ: <์ฒ˜์Œ๋ถ€ํ„ฐ ์•Œ์•„๋ณด๋Š” ์›น ํฌ๋กค๋Ÿฌ>
Python
1
star
39

paper2code

paper2code with TF / PyTorch
Dockerfile
1
star
40

WebCrawlerExamples

WebCrawler ๊ฐ•์˜ ์˜ˆ์ œ
Python
1
star
41

NexonTalk20180612

๋„ฅ์Šจํ† ํฌ 20180612 : ๋”ฅ๋Ÿฌ๋‹ ์†๊ธ€์”จ ์ธ์‹ ์„œ๋น„์Šค - III, ์„œ๋ฒ„๋ฆฌ์Šค ์›น ํ”„๋ก ํŠธ ๋งŒ๋“ค๊ธฐ
JavaScript
1
star