• Stars
    star
    1,278
  • Rank 36,838 (Top 0.8 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Korean BERT pre-trained cased (KoBERT)

KoBERT


Korean BERT pre-trained cased (KoBERT)

Why'?'

Training Environment

  • Architecture
predefined_args = {
        'attention_cell': 'multi_head',
        'num_layers': 12,
        'units': 768,
        'hidden_size': 3072,
        'max_length': 512,
        'num_heads': 12,
        'scaled': True,
        'dropout': 0.1,
        'use_residual': True,
        'embed_size': 768,
        'embed_dropout': 0.1,
        'token_type_vocab_size': 2,
        'word_embed': None,
    }
  • ํ•™์Šต์…‹
๋ฐ์ดํ„ฐ ๋ฌธ์žฅ ๋‹จ์–ด
ํ•œ๊ตญ์–ด ์œ„ํ‚ค 5M 54M
  • ํ•™์Šต ํ™˜๊ฒฝ
    • V100 GPU x 32, Horovod(with InfiniBand)

2019-04-29 ํ…์„œ๋ณด๋“œ ๋กœ๊ทธ

  • ์‚ฌ์ „(Vocabulary)
    • ํฌ๊ธฐ : 8,002
    • ํ•œ๊ธ€ ์œ„ํ‚ค ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ํ† ํฌ๋‚˜์ด์ €(SentencePiece)
    • Less number of parameters(92M < 110M )

Requirements

How to install

  • Install KoBERT as a python package

    pip install git+https://[email protected]/SKTBrain/KoBERT.git@master
  • If you want to modify source codes, please clone this repository

    git clone https://github.com/SKTBrain/KoBERT.git
    cd KoBERT
    pip install -r requirements.txt

How to use

Using with PyTorch

Huggingface transformers API๊ฐ€ ํŽธํ•˜์‹  ๋ถ„์€ ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

>>> import torch
>>> from kobert import get_pytorch_kobert_model
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> model, vocab  = get_pytorch_kobert_model()
>>> sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
>>> pooled_output.shape
torch.Size([2, 768])
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> sequence_output[0]
tensor([[-0.2461,  0.2428,  0.2590,  ..., -0.4861, -0.0731,  0.0756],
        [-0.2478,  0.2420,  0.2552,  ..., -0.4877, -0.0727,  0.0754],
        [-0.2472,  0.2420,  0.2561,  ..., -0.4874, -0.0733,  0.0765]],
       grad_fn=<SelectBackward>)

model์€ ๋””ํดํŠธ๋กœ eval()๋ชจ๋“œ๋กœ ๋ฆฌํ„ด๋จ, ๋”ฐ๋ผ์„œ ํ•™์Šต ์šฉ๋„๋กœ ์‚ฌ์šฉ์‹œ model.train()๋ช…๋ น์„ ํ†ตํ•ด ํ•™์Šต ๋ชจ๋“œ๋กœ ๋ณ€๊ฒฝํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

  • Naver Sentiment Analysis Fine-Tuning with pytorch
    • Colab์—์„œ [๋Ÿฐํƒ€์ž„] - [๋Ÿฐํƒ€์ž„ ์œ ํ˜• ๋ณ€๊ฒฝ] - ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ(GPU) ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
    • Open In Colab

Using with ONNX

>>> import onnxruntime
>>> import numpy as np
>>> from kobert import get_onnx_kobert_model
>>> onnx_path = get_onnx_kobert_model()
>>> sess = onnxruntime.InferenceSession(onnx_path)
>>> input_ids = [[31, 51, 99], [15, 5, 0]]
>>> input_mask = [[1, 1, 1], [1, 1, 0]]
>>> token_type_ids = [[0, 0, 1], [0, 1, 0]]
>>> len_seq = len(input_ids[0])
>>> pred_onnx = sess.run(None, {'input_ids':np.array(input_ids),
>>>                             'token_type_ids':np.array(token_type_ids),
>>>                             'input_mask':np.array(input_mask),
>>>                             'position_ids':np.array(range(len_seq))})
>>> # Last Encoding Layer
>>> pred_onnx[-2][0]
array([[-0.24610452,  0.24282141,  0.25895312, ..., -0.48613444,
        -0.07305173,  0.07560554],
       [-0.24783179,  0.24200465,  0.25520486, ..., -0.4877185 ,
        -0.0727044 ,  0.07536091],
       [-0.24721591,  0.24196623,  0.2560626 , ..., -0.48743123,
        -0.07326943,  0.07650235]], dtype=float32)

ONNX ์ปจ๋ฒ„ํŒ…์€ soeque1๊ป˜์„œ ๋„์›€์„ ์ฃผ์…จ์Šต๋‹ˆ๋‹ค.

Using with MXNet-Gluon

>>> import mxnet as mx
>>> from kobert import get_mxnet_kobert_model
>>> input_id = mx.nd.array([[31, 51, 99], [15, 5, 0]])
>>> input_mask = mx.nd.array([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = mx.nd.array([[0, 0, 1], [0, 1, 0]])
>>> model, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False)
>>> encoder_layer, pooled_output = model(input_id, token_type_ids)
>>> pooled_output.shape
(2, 768)
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> encoder_layer[0]
[[-0.24610372  0.24282135  0.2589539  ... -0.48613444 -0.07305248
   0.07560539]
 [-0.24783105  0.242005    0.25520545 ... -0.48771808 -0.07270523
   0.07536077]
 [-0.24721491  0.241966    0.25606337 ... -0.48743105 -0.07327032
   0.07650219]]
<NDArray 3x768 @cpu(0)>
  • Naver Sentiment Analysis Fine-Tuning with MXNet
    • Open In Colab

Tokenizer

>>> from gluonnlp.data import SentencepieceTokenizer
>>> from kobert import get_tokenizer
>>> tok_path = get_tokenizer()
>>> sp  = SentencepieceTokenizer(tok_path)
>>> sp('ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.')
['โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.']

Subtasks

Naver Sentiment Analysis

Model Accuracy
BERT base multilingual cased 0.875
KoBERT 0.901
KoGPT2 0.899

KoBERT์™€ CRF๋กœ ๋งŒ๋“  ํ•œ๊ตญ์–ด ๊ฐ์ฒด๋ช…์ธ์‹๊ธฐ

๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜์„ธ์š”:  SKTBrain์—์„œ KoBERT ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ด์ค€ ๋•๋ถ„์— BERT-CRF ๊ธฐ๋ฐ˜ ๊ฐ์ฒด๋ช…์ธ์‹๊ธฐ๋ฅผ ์‰ฝ๊ฒŒ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
len: 40, input_token:['[CLS]', 'โ–SK', 'T', 'B', 'ra', 'in', '์—์„œ', 'โ–K', 'o', 'B', 'ER', 'T', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต๊ฐœ', 'ํ•ด', '์ค€', 'โ–๋•๋ถ„์—', 'โ–B', 'ER', 'T', '-', 'C', 'R', 'F', 'โ–๊ธฐ๋ฐ˜', 'โ–', '๊ฐ', '์ฒด', '๋ช…', '์ธ', '์‹', '๊ธฐ๋ฅผ', 'โ–์‰ฝ๊ฒŒ', 'โ–๊ฐœ๋ฐœ', 'ํ• ', 'โ–์ˆ˜', 'โ–์žˆ์—ˆ๋‹ค', '.', '[SEP]']
len: 40, pred_ner_tag:['[CLS]', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '[SEP]']
decoding_ner_sentence: [CLS] <SKTBrain:ORG>์—์„œ <KoBERT:POH> ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ด์ค€ ๋•๋ถ„์— <BERT-CRF:POH> ๊ธฐ๋ฐ˜ ๊ฐ์ฒด๋ช…์ธ์‹๊ธฐ๋ฅผ ์‰ฝ๊ฒŒ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.[SEP]

Korean Sentence BERT

Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
NLl 65.05 68.48 68.81 68.18 68.90 68.20 65.22 66.81
STS 80.42 79.64 77.93 77.43 77.92 77.44 76.56 75.83
STS + NLI 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22

Release

  • v0.2.3
    • support onnx 1.8.0
  • v0.2.2
    • fix No module named 'kobert.utils'
  • v0.2.1
    • guide default 'import statements'
  • v0.2
    • download large files from aws s3
    • rename functions
  • v0.1.2
    • Guaranteed compatibility with higher versions of transformers
    • fix pad token index id
  • v0.1.1
    • ์‚ฌ์ „(vocabulary)๊ณผ ํ† ํฌ๋‚˜์ด์ € ํ†ตํ•ฉ
  • v0.1
    • ์ดˆ๊ธฐ ๋ชจ๋ธ ๋ฆด๋ฆฌ์ฆˆ

Contacts

KoBERT ๊ด€๋ จ ์ด์Šˆ๋Š” ์ด๊ณณ์— ๋“ฑ๋กํ•ด ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

License

KoBERT๋Š” Apache-2.0 ๋ผ์ด์„ ์Šค ํ•˜์— ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๋ฐ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ผ์ด์„ ์Šค ๋‚ด์šฉ์„ ์ค€์ˆ˜ํ•ด์ฃผ์„ธ์š”. ๋ผ์ด์„ ์Šค ์ „๋ฌธ์€ LICENSE ํŒŒ์ผ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.