KoBERT
Korean BERT pre-trained cased (KoBERT)
Why'?'
- ๊ตฌ๊ธ BERT base multilingual cased์ ํ๊ตญ์ด ์ฑ๋ฅ ํ๊ณ
Training Environment
- Architecture
predefined_args = {
'attention_cell': 'multi_head',
'num_layers': 12,
'units': 768,
'hidden_size': 3072,
'max_length': 512,
'num_heads': 12,
'scaled': True,
'dropout': 0.1,
'use_residual': True,
'embed_size': 768,
'embed_dropout': 0.1,
'token_type_vocab_size': 2,
'word_embed': None,
}
- ํ์ต์
๋ฐ์ดํฐ | ๋ฌธ์ฅ | ๋จ์ด |
---|---|---|
ํ๊ตญ์ด ์ํค | 5M | 54M |
- ํ์ต ํ๊ฒฝ
- V100 GPU x 32, Horovod(with InfiniBand)
- ์ฌ์ (Vocabulary)
- ํฌ๊ธฐ : 8,002
- ํ๊ธ ์ํค ๊ธฐ๋ฐ์ผ๋ก ํ์ตํ ํ ํฌ๋์ด์ (SentencePiece)
- Less number of parameters(92M < 110M )
Requirements
- see requirements.txt
How to install
-
Install KoBERT as a python package
pip install git+https://[email protected]/SKTBrain/KoBERT.git@master
-
If you want to modify source codes, please clone this repository
git clone https://github.com/SKTBrain/KoBERT.git cd KoBERT pip install -r requirements.txt
How to use
Using with PyTorch
Huggingface transformers API๊ฐ ํธํ์ ๋ถ์ ์ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ์ธ์.
>>> import torch
>>> from kobert import get_pytorch_kobert_model
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> model, vocab = get_pytorch_kobert_model()
>>> sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
>>> pooled_output.shape
torch.Size([2, 768])
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> sequence_output[0]
tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756],
[-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754],
[-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]],
grad_fn=<SelectBackward>)
model
์ ๋ํดํธ๋ก eval()
๋ชจ๋๋ก ๋ฆฌํด๋จ, ๋ฐ๋ผ์ ํ์ต ์ฉ๋๋ก ์ฌ์ฉ์ model.train()
๋ช
๋ น์ ํตํด ํ์ต ๋ชจ๋๋ก ๋ณ๊ฒฝํ ํ์๊ฐ ์๋ค.
- Naver Sentiment Analysis Fine-Tuning with pytorch
Using with ONNX
>>> import onnxruntime
>>> import numpy as np
>>> from kobert import get_onnx_kobert_model
>>> onnx_path = get_onnx_kobert_model()
>>> sess = onnxruntime.InferenceSession(onnx_path)
>>> input_ids = [[31, 51, 99], [15, 5, 0]]
>>> input_mask = [[1, 1, 1], [1, 1, 0]]
>>> token_type_ids = [[0, 0, 1], [0, 1, 0]]
>>> len_seq = len(input_ids[0])
>>> pred_onnx = sess.run(None, {'input_ids':np.array(input_ids),
>>> 'token_type_ids':np.array(token_type_ids),
>>> 'input_mask':np.array(input_mask),
>>> 'position_ids':np.array(range(len_seq))})
>>> # Last Encoding Layer
>>> pred_onnx[-2][0]
array([[-0.24610452, 0.24282141, 0.25895312, ..., -0.48613444,
-0.07305173, 0.07560554],
[-0.24783179, 0.24200465, 0.25520486, ..., -0.4877185 ,
-0.0727044 , 0.07536091],
[-0.24721591, 0.24196623, 0.2560626 , ..., -0.48743123,
-0.07326943, 0.07650235]], dtype=float32)
ONNX ์ปจ๋ฒํ ์ soeque1๊ป์ ๋์์ ์ฃผ์ จ์ต๋๋ค.
Using with MXNet-Gluon
>>> import mxnet as mx
>>> from kobert import get_mxnet_kobert_model
>>> input_id = mx.nd.array([[31, 51, 99], [15, 5, 0]])
>>> input_mask = mx.nd.array([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = mx.nd.array([[0, 0, 1], [0, 1, 0]])
>>> model, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False)
>>> encoder_layer, pooled_output = model(input_id, token_type_ids)
>>> pooled_output.shape
(2, 768)
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> encoder_layer[0]
[[-0.24610372 0.24282135 0.2589539 ... -0.48613444 -0.07305248
0.07560539]
[-0.24783105 0.242005 0.25520545 ... -0.48771808 -0.07270523
0.07536077]
[-0.24721491 0.241966 0.25606337 ... -0.48743105 -0.07327032
0.07650219]]
<NDArray 3x768 @cpu(0)>
Tokenizer
- Pretrained Sentencepiece tokenizer
>>> from gluonnlp.data import SentencepieceTokenizer
>>> from kobert import get_tokenizer
>>> tok_path = get_tokenizer()
>>> sp = SentencepieceTokenizer(tok_path)
>>> sp('ํ๊ตญ์ด ๋ชจ๋ธ์ ๊ณต์ ํฉ๋๋ค.')
['โํ๊ตญ', '์ด', 'โ๋ชจ๋ธ', '์', 'โ๊ณต์ ', 'ํฉ๋๋ค', '.']
Subtasks
Naver Sentiment Analysis
- Dataset : https://github.com/e9t/nsmc
Model | Accuracy |
---|---|
BERT base multilingual cased | 0.875 |
KoBERT | 0.901 |
KoGPT2 | 0.899 |
KoBERT์ CRF๋ก ๋ง๋ ํ๊ตญ์ด ๊ฐ์ฒด๋ช ์ธ์๊ธฐ
๋ฌธ์ฅ์ ์
๋ ฅํ์ธ์: SKTBrain์์ KoBERT ๋ชจ๋ธ์ ๊ณต๊ฐํด์ค ๋๋ถ์ BERT-CRF ๊ธฐ๋ฐ ๊ฐ์ฒด๋ช
์ธ์๊ธฐ๋ฅผ ์ฝ๊ฒ ๊ฐ๋ฐํ ์ ์์๋ค.
len: 40, input_token:['[CLS]', 'โSK', 'T', 'B', 'ra', 'in', '์์', 'โK', 'o', 'B', 'ER', 'T', 'โ๋ชจ๋ธ', '์', 'โ๊ณต๊ฐ', 'ํด', '์ค', 'โ๋๋ถ์', 'โB', 'ER', 'T', '-', 'C', 'R', 'F', 'โ๊ธฐ๋ฐ', 'โ', '๊ฐ', '์ฒด', '๋ช
', '์ธ', '์', '๊ธฐ๋ฅผ', 'โ์ฝ๊ฒ', 'โ๊ฐ๋ฐ', 'ํ ', 'โ์', 'โ์์๋ค', '.', '[SEP]']
len: 40, pred_ner_tag:['[CLS]', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '[SEP]']
decoding_ner_sentence: [CLS] <SKTBrain:ORG>์์ <KoBERT:POH> ๋ชจ๋ธ์ ๊ณต๊ฐํด์ค ๋๋ถ์ <BERT-CRF:POH> ๊ธฐ๋ฐ ๊ฐ์ฒด๋ช
์ธ์๊ธฐ๋ฅผ ์ฝ๊ฒ ๊ฐ๋ฐํ ์ ์์๋ค.[SEP]
Korean Sentence BERT
Model | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
---|---|---|---|---|---|---|---|---|
NLl | 65.05 | 68.48 | 68.81 | 68.18 | 68.90 | 68.20 | 65.22 | 66.81 |
STS | 80.42 | 79.64 | 77.93 | 77.43 | 77.92 | 77.44 | 76.56 | 75.83 |
STS + NLI | 78.81 | 78.47 | 77.68 | 77.78 | 77.71 | 77.83 | 75.75 | 75.22 |
Release
- v0.2.3
- support
onnx 1.8.0
- support
- v0.2.2
- fix
No module named 'kobert.utils'
- fix
- v0.2.1
- guide default 'import statements'
- v0.2
- download large files from
aws s3
- rename functions
- download large files from
- v0.1.2
- Guaranteed compatibility with higher versions of transformers
- fix pad token index id
- v0.1.1
- ์ฌ์ (vocabulary)๊ณผ ํ ํฌ๋์ด์ ํตํฉ
- v0.1
- ์ด๊ธฐ ๋ชจ๋ธ ๋ฆด๋ฆฌ์ฆ
Contacts
KoBERT
๊ด๋ จ ์ด์๋ ์ด๊ณณ์ ๋ฑ๋กํด ์ฃผ์๊ธฐ ๋ฐ๋๋๋ค.
License
KoBERT
๋ Apache-2.0
๋ผ์ด์ ์ค ํ์ ๊ณต๊ฐ๋์ด ์์ต๋๋ค. ๋ชจ๋ธ ๋ฐ ์ฝ๋๋ฅผ ์ฌ์ฉํ ๊ฒฝ์ฐ ๋ผ์ด์ ์ค ๋ด์ฉ์ ์ค์ํด์ฃผ์ธ์. ๋ผ์ด์ ์ค ์ ๋ฌธ์ LICENSE
ํ์ผ์์ ํ์ธํ์ค ์ ์์ต๋๋ค.