• Stars
    star
    201
  • Rank 187,695 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ¦… Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean

What is BigBird β€’ How to Use β€’ Pretraining β€’ Evaluation Result β€’ Docs β€’ Citation

ν•œκ΅­μ–΄ | English

Apache 2.0 Issues linter DOI

What is BigBird?

BigBird: Transformers for Longer Sequencesμ—μ„œ μ†Œκ°œλœ sparse-attention 기반의 λͺ¨λΈλ‘œ, 일반적인 BERT보닀 더 κΈ΄ sequenceλ₯Ό λ‹€λ£° 수 μžˆμŠ΅λ‹ˆλ‹€.

πŸ¦… Longer Sequence - μ΅œλŒ€ 512개의 token을 λ‹€λ£° 수 μžˆλŠ” BERT의 8배인 μ΅œλŒ€ 4096개의 token을 λ‹€λ£Έ

⏱️ Computational Efficiency - Full attention이 μ•„λ‹Œ Sparse Attention을 μ΄μš©ν•˜μ—¬ O(n2)μ—μ„œ O(n)으둜 κ°œμ„ 

How to Use

  • πŸ€— Huggingface Hub에 μ—…λ‘œλ“œλœ λͺ¨λΈμ„ κ³§λ°”λ‘œ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:)
  • 일뢀 μ΄μŠˆκ°€ ν•΄κ²°λœ transformers>=4.11.0 μ‚¬μš©μ„ ꢌμž₯ν•©λ‹ˆλ‹€. (MRC 이슈 κ΄€λ ¨ PR)
  • BigBirdTokenizer λŒ€μ‹ μ— BertTokenizer λ₯Ό μ‚¬μš©ν•΄μ•Ό ν•©λ‹ˆλ‹€. (AutoTokenizer μ‚¬μš©μ‹œ BertTokenizerκ°€ λ‘œλ“œλ©λ‹ˆλ‹€.)
  • μžμ„Έν•œ μ‚¬μš©λ²•μ€ BigBird Tranformers documentation을 μ°Έκ³ ν•΄μ£Όμ„Έμš”.
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

μžμ„Έν•œ λ‚΄μš©μ€ [Pretraining BigBird] μ°Έκ³ 

Hardware Max len LR Batch Train Step Warmup Step
KoBigBird-BERT-Base TPU v3-8 4096 1e-4 32 2M 20k
  • λͺ¨λ‘μ˜ λ§λ­‰μΉ˜, ν•œκ΅­μ–΄ μœ„ν‚€, Common Crawl, λ‰΄μŠ€ 데이터 λ“± λ‹€μ–‘ν•œ λ°μ΄ν„°λ‘œ ν•™μŠ΅
  • ITC (Internal Transformer Construction) λͺ¨λΈλ‘œ ν•™μŠ΅ (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

μžμ„Έν•œ λ‚΄μš©μ€ [Finetune on Short Sequence Dataset] μ°Έκ³ 

NSMC
(acc)
KLUE-NLI
(acc)
KLUE-STS
(pearsonr)
Korquad 1.0
(em/f1)
KLUE MRC
(em/rouge-w)
KoELECTRA-Base-v3 91.13 86.87 93.14 85.66 / 93.94 59.54 / 65.64
KLUE-RoBERTa-Base 91.16 86.30 92.91 85.35 / 94.53 69.56 / 74.64
KoBigBird-BERT-Base 91.18 87.17 92.61 87.08 / 94.71 70.33 / 75.34

2. Long Sequence (>=1024)

μžμ„Έν•œ λ‚΄μš©μ€ [Finetune on Long Sequence Dataset] μ°Έκ³ 

TyDi QA
(em/f1)
Korquad 2.1
(em/f1)
Fake News
(f1)
Modu Sentiment
(f1-macro)
KLUE-RoBERTa-Base 76.80 / 78.58 55.44 / 73.02 95.20 42.61
KoBigBird-BERT-Base 79.13 / 81.30 67.77 / 82.03 98.85 45.42

Docs

Citation

KoBigBirdλ₯Ό μ‚¬μš©ν•˜μ‹ λ‹€λ©΄ μ•„λž˜μ™€ 같이 μΈμš©ν•΄μ£Όμ„Έμš”.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBirdλŠ” Tensorflow Research Cloud (TFRC) ν”„λ‘œκ·Έλž¨μ˜ Cloud TPU μ§€μ›μœΌλ‘œ μ œμž‘λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

λ˜ν•œ 멋진 둜고λ₯Ό μ œκ³΅ν•΄μ£Όμ‹  Seyun Ahnλ‹˜κ»˜ 감사λ₯Ό μ „ν•©λ‹ˆλ‹€.

More Repositories

1

JointBERT

Pytorch implementation of JointBERT: "BERT for Joint Intent Classification and Slot Filling"
Python
600
star
2

KoELECTRA

Pretrained ELECTRA Model for Korean
Python
584
star
3

R-BERT

Pytorch implementation of R-BERT: "Enriching Pre-trained Language Model with Entity Information for Relation Classification"
Python
333
star
4

KoBERT-Transformers

KoBERT on πŸ€— Huggingface Transformers πŸ€— (with Bug Fixed)
Python
190
star
5

DistilKoBERT

Distillation of KoBERT from SKTBrain (Lightweight KoBERT)
Python
180
star
6

GoEmotions-pytorch

Pytorch Implementation of GoEmotions 😍😒😱
Python
142
star
7

KoBERT-NER

NER Task with KoBERT (with Naver NLP Challenge dataset)
Python
90
star
8

HanBert-Transformers

HanBert on πŸ€— Huggingface Transformers πŸ€—
Python
85
star
9

KoBERT-nsmc

Naver movie review sentiment classification with KoBERT
Python
76
star
10

transformers-android-demo

πŸ“² Transformers android examples (Tensorflow Lite & Pytorch Mobile)
Java
76
star
11

KoBERT-KorQuAD

Korean MRC (KorQuAD) with KoBERT
Python
66
star
12

nlp-arxiv-daily

Automatically Update NLP Papers Daily using Github Actions (ref: https://github.com/Vincentqyw/cv-arxiv-daily)
Python
63
star
13

EncT5

Pytorch Implementation of EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks
Python
58
star
14

NER-Multimodal-pytorch

Pytorch Implementation of "Adaptive Co-attention Network for Named Entity Recognition in Tweets" (AAAI 2018)
Python
56
star
15

KoCharELECTRA

Character-level Korean ELECTRA Model (음절 λ‹¨μœ„ ν•œκ΅­μ–΄ ELECTRA)
Python
53
star
16

GoEmotions-Korean

Korean version of GoEmotions Dataset 😍😒😱
Python
50
star
17

hashtag-prediction-pytorch

Multimodal Hashtag Prediction with instagram data & pytorch (2nd Place on OpenResource Hackathon 2019)
Python
47
star
18

KoELECTRA-Pipeline

Transformers Pipeline with KoELECTRA
Python
40
star
19

ko_lm_dataformat

A utility for storing and reading files for Korean LM training πŸ’Ύ
Python
36
star
20

korean-ner-pytorch

NER Task with CNN + BiLSTM + CRF (with Naver NLP Challenge dataset) with Pytorch
Python
27
star
21

korean-hate-speech-koelectra

Bias, Hate classification with KoELECTRA πŸ‘Ώ
Python
26
star
22

python-template

Python template code
Makefile
21
star
23

naver-nlp-challenge-2018

NER task for Naver NLP Challenge 2018 (3rd Place)
Python
19
star
24

BIO-R-BERT

R-BERT on DDI Bio dataset with BioBERT
Python
17
star
25

HanBert-NER

NER Task with HanBert (with Naver NLP Challenge dataset)
Python
16
star
26

kakaotrans

[Unofficial] Kakaotrans: Kakao translate API for python
Python
15
star
27

py-backtrans

Python library for backtranslation (with Google Translate)
Python
12
star
28

dotfiles

Simple setup for personal dotfiles
Shell
10
star
29

monologg

Profile repository
9
star
30

kobert2transformers

KoBERT to transformers library format
Python
7
star
31

ner-sample

NER Sample Code
Python
7
star
32

HanBert-nsmc

Naver movie review sentiment classification with HanBert
Python
4
star
33

torchserve-practice

Python
4
star
34

monologg.github.io

Personal Blog https://monologg.github.io
CSS
3
star