• Stars
    star
    515
  • Rank 83,362 (Top 2 %)
  • Language
  • License
    Other
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Korean GPT-2 pretrained cased (KoGPT2)

KoGPT2 (ํ•œ๊ตญ์–ด GPT-2) Ver 2.0

GPT-2๋Š” ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ์˜ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์ž˜ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต๋œ ์–ธ์–ด๋ชจ๋ธ์ด๋ฉฐ ๋ฌธ์žฅ ์ƒ์„ฑ์— ์ตœ์ ํ™” ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. KoGPT2๋Š” ๋ถ€์กฑํ•œ ํ•œ๊ตญ์–ด ์„ฑ๋Šฅ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด 40GB ์ด์ƒ์˜ ํ…์ŠคํŠธ๋กœ ํ•™์Šต๋œ ํ•œ๊ตญ์–ด ๋””์ฝ”๋”(decoder) ์–ธ์–ด๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

Tokenizer

tokenizers ํŒจํ‚ค์ง€์˜ Character BPE tokenizer๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์ „ ํฌ๊ธฐ๋Š” 51,200 ์ด๋ฉฐ ๋Œ€ํ™”์— ์ž์ฃผ ์“ฐ์ด๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ์ด๋ชจํ‹ฐ์ฝ˜, ์ด๋ชจ์ง€ ๋“ฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•ด๋‹น ํ† ํฐ์˜ ์ธ์‹ ๋Šฅ๋ ฅ์„ ์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค.

๐Ÿ˜€, ๐Ÿ˜, ๐Ÿ˜†, ๐Ÿ˜…, ๐Ÿคฃ, .. , :-), :), -), (-:...

๋˜ํ•œ <unused0> ~ <unused99>๋“ฑ์˜ ๋ฏธ์‚ฌ์šฉ ํ† ํฐ์„ ์ •์˜ํ•ด ํ•„์š”ํ•œ ํ…Œ์Šคํฌ์— ๋”ฐ๋ผ ์ž์œ ๋กญ๊ฒŒ ์ •์˜ํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

> from transformers import PreTrainedTokenizerFast
> tokenizer = PreTrainedTokenizerFast.from_pretrained("skt/kogpt2-base-v2",
  bos_token='</s>', eos_token='</s>', unk_token='<unk>',
  pad_token='<pad>', mask_token='<mask>')
> tokenizer.tokenize("์•ˆ๋…•ํ•˜์„ธ์š”. ํ•œ๊ตญ์–ด GPT-2 ์ž…๋‹ˆ๋‹ค.๐Ÿ˜ค:)l^o")
['โ–์•ˆ๋…•', 'ํ•˜', '์„ธ', '์š”.', 'โ–ํ•œ๊ตญ์–ด', 'โ–G', 'P', 'T', '-2', 'โ–์ž…', '๋‹ˆ๋‹ค.', '๐Ÿ˜ค', ':)', 'l^o']

Model

Model # of params Type # of layers # of heads ffn_dim hidden_dims
kogpt2-base-v2 125M Decoder 12 12 3072 768
> import torch
> from transformers import GPT2LMHeadModel

> model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')
> text = '๊ทผ์œก์ด ์ปค์ง€๊ธฐ ์œ„ํ•ด์„œ๋Š”'
> input_ids = tokenizer.encode(text, return_tensors='pt')
> gen_ids = model.generate(input_ids,
                           max_length=128,
                           repetition_penalty=2.0,
                           pad_token_id=tokenizer.pad_token_id,
                           eos_token_id=tokenizer.eos_token_id,
                           bos_token_id=tokenizer.bos_token_id,
                           use_cache=True)
> generated = tokenizer.decode(gen_ids[0])
> print(generated)
๊ทผ์œก์ด ์ปค์ง€๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฌด์—‡๋ณด๋‹ค ๊ทœ์น™์ ์ธ ์ƒํ™œ์Šต๊ด€์ด ์ค‘์š”ํ•˜๋‹ค.
ํŠนํžˆ, ์•„์นจ์‹์‚ฌ๋Š” ๋‹จ๋ฐฑ์งˆ๊ณผ ๋น„ํƒ€๋ฏผ์ด ํ’๋ถ€ํ•œ ๊ณผ์ผ๊ณผ ์ฑ„์†Œ๋ฅผ ๋งŽ์ด ์„ญ์ทจํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
๋˜ํ•œ ํ•˜๋ฃจ 30๋ถ„ ์ด์ƒ ์ถฉ๋ถ„ํ•œ ์ˆ˜๋ฉด์„ ์ทจํ•˜๋Š” ๊ฒƒ๋„ ๋„์›€์ด ๋œ๋‹ค.
์•„์นจ ์‹์‚ฌ๋ฅผ ๊ฑฐ๋ฅด์ง€ ์•Š๊ณ  ๊ทœ์น™์ ์œผ๋กœ ์šด๋™์„ ํ•˜๋ฉด ํ˜ˆ์•ก์ˆœํ™˜์— ๋„์›€์„ ์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์‹ ์ง„๋Œ€์‚ฌ๋ฅผ ์ด‰์ง„ํ•ด ์ฒด๋‚ด ๋…ธํ๋ฌผ์„ ๋ฐฐ์ถœํ•˜๊ณ  ํ˜ˆ์••์„ ๋‚ฎ์ถฐ์ค€๋‹ค.
์šด๋™์€ ํ•˜๋ฃจ์— 10๋ถ„ ์ •๋„๋งŒ ํ•˜๋Š” ๊ฒŒ ์ข‹์œผ๋ฉฐ ์šด๋™ ํ›„์—๋Š” ๋ฐ˜๋“œ์‹œ ์ŠคํŠธ๋ ˆ์นญ์„ ํ†ตํ•ด ๊ทผ์œก๋Ÿ‰์„ ๋Š˜๋ฆฌ๊ณ  ์œ ์—ฐ์„ฑ์„ ๋†’์—ฌ์•ผ ํ•œ๋‹ค.
์šด๋™ ํ›„ ๋ฐ”๋กœ ์ž ์ž๋ฆฌ์— ๋“œ๋Š” ๊ฒƒ์€ ํ”ผํ•ด์•ผ ํ•˜๋ฉฐ ํŠนํžˆ ์•„์นจ์— ์ผ์–ด๋‚˜๋ฉด ๋ชธ์ด ํ”ผ๊ณคํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋ฌด๋ฆฌํ•˜๊ฒŒ ์›€์ง์ด๋ฉด ์˜คํžˆ๋ ค ์—ญํšจ๊ณผ๊ฐ€ ๋‚  ์ˆ˜๋„ ์žˆ๋‹ค...

Performances

Classification or Regression

NSMC(acc) KorSTS(spearman)
KoGPT2 2.0 89.1 77.8

Data

ํ•œ๊ตญ์–ด ์œ„ํ‚ค ๋ฐฑ๊ณผ ์ด์™ธ, ๋‰ด์Šค, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ v1.0, ์ฒญ์™€๋Œ€ ๊ตญ๋ฏผ์ฒญ์› ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Demo

๋ฐ๋ชจ ๋งํฌ

User Contributed Examples

Related press releases

Contacts

KoGPT2 ๊ด€๋ จ ์ด์Šˆ๋Š” ์ด๊ณณ์— ์˜ฌ๋ ค์ฃผ์„ธ์š”.

License

KoGPT2๋Š” CC-BY-NC-SA 4.0 ๋ผ์ด์„ ์Šค ํ•˜์— ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๋ฐ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ผ์ด์„ ์Šค ๋‚ด์šฉ์„ ์ค€์ˆ˜ํ•ด์ฃผ์„ธ์š”. ๋ผ์ด์„ ์Šค ์ „๋ฌธ์€ LICENSE ํŒŒ์ผ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.