• Stars
    star
    105
  • Rank 328,196 (Top 7 %)
  • Language
    Python
  • License
    Creative Commons ...
  • Created almost 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Korean Sentence Embedding Repository

Korean-Sentence-Embedding

The Korean Sentence Embedding Repository offers pre-trained models, readily available for immediate download and inference. Additionally, it furnishes an environment conducive to individualized model training.

Quick tour

Note
All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/BM-K

import torch
from transformers import AutoModel, AutoTokenizer

def cal_score(a, b):
    if len(a.shape) == 1: a = a.unsqueeze(0)
    if len(b.shape) == 1: b = b.unsqueeze(0)

    a_norm = a / a.norm(dim=1)[:, None]
    b_norm = b / b.norm(dim=1)[:, None]
    return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100

model = AutoModel.from_pretrained('BM-K/KoSimCSE-roberta-multitask')  # or 'BM-K/KoSimCSE-bert-multitask'
tokenizer = AutoTokenizer.from_pretrained('BM-K/KoSimCSE-roberta-multitask')  # or 'BM-K/KoSimCSE-bert-multitask'

sentences = ['์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.',
             '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.',
             '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.']

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings, _ = model(**inputs, return_dict=False)

score01 = cal_score(embeddings[0][0], embeddings[1][0])  # 84.09
# '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.' @ '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.'
score02 = cal_score(embeddings[0][0], embeddings[2][0])  # 23.21
# '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.' @ '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.'

Update history

** Updates on Mar.08.2023 **

  • Update Unsupervised Models

** Updates on Feb.24.2023 **

  • Upload KoSimCSE clustering example

** Updates on Nov.15.2022 **

  • Upload KoDiffCSE-unsupervised training code

** Updates on Oct.27.2022 **

  • Upload KoDiffCSE-unsupervised performance

** Updates on Oct.21.2022 **

  • Upload KoSimCSE-unsupervised performance

** Updates on Jun.01.2022 **

  • Release KoSimCSE-multitask models

** Updates on May.23.2022 **

  • Upload KoSentenceT5 training code
  • Upload KoSentenceT5 performance

** Updates on Mar.01.2022 **

  • Release KoSimCSE

** Updates on Feb.11.2022 **

  • Upload KoSimCSE training code
  • Upload KoSimCSE performance

** Updates on Jan.26.2022 **

  • Upload KoSBERT training code
  • Upload KoSBERT performance

Baseline Models

Baseline models used for korean sentence embedding - KLUE-PLMs

Model Embedding size Hidden size # Layers # Heads
KLUE-BERT-base 768 768 12 12
KLUE-RoBERTa-base 768 768 12 12

Warning
Large pre-trained models need a lot of GPU memory to train

Available Models

  1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [SBERT]-[EMNLP 2019]
  2. SimCSE: Simple Contrastive Learning of Sentence Embeddings [SimCSE]-[EMNLP 2021]
  3. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models [Sentence-T5]-[ACL findings 2022]
  4. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings [DiffCSE]-[NAACL 2022]

Datasets

Setups

Python Pytorch

KoSentenceBERT

  • ๐Ÿค— Model Training
  • Dataset (Supervised)
    • Training: snli_1.0_train.ko.tsv, sts-train.tsv (multi-task)
      • Performance can be further improved by adding multinli data to training.
    • Validation: sts-dev.tsv
    • Test: sts-test.tsv

KoSimCSE

  • ๐Ÿค— Model Training
  • Dataset (Supervised)
    • Training: snli_1.0_train.ko.tsv + multinli.train.ko.tsv (Supervised setting)
    • Validation: sts-dev.tsv
    • Test: sts-test.tsv
  • Dataset (Unsupervised)
    • Training: wiki_corpus.txt
    • Validation: sts-dev.tsv
    • Test: sts-test.tsv

KoSentenceT5

  • ๐Ÿค— Model Training
  • Dataset (Supervised)
    • Training: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
    • Validation: sts-dev.tsv
    • Test: sts-test.tsv

KoDiffCSE

  • ๐Ÿค— Model Training
  • Dataset (Unsupervised)
    • Training: wiki_corpus.txt
    • Validation: sts-dev.tsv
    • Test: sts-test.tsv

Performance-supervised

Model Average Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERTโ€ SKT 77.40 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERT 80.39 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTa 81.64 81.20 82.20 81.79 82.34 81.59 82.20 80.62 81.25
KoSentenceBART 77.14 79.71 78.74 78.42 78.02 78.40 78.00 74.24 72.15
KoSentenceT5 77.83 80.87 79.74 80.24 79.36 80.19 79.27 72.81 70.17
KoSimCSE-BERTโ€ SKT 81.32 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERT 83.37 83.22 83.58 83.24 83.60 83.15 83.54 83.13 83.49
KoSimCSE-RoBERTa 83.65 83.60 83.77 83.54 83.76 83.55 83.77 83.55 83.64
KoSimCSE-BERT-multitask 85.71 85.29 86.02 85.63 86.01 85.57 85.97 85.26 85.93
KoSimCSE-RoBERTa-multitask 85.77 85.08 86.12 85.84 86.12 85.83 86.12 85.03 85.99

Performance-unsupervised

Model Average Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSRoBERTa-baseโ€  N/A N/A 48.96 N/A N/A N/A N/A N/A N/A
KoSRoBERTa-largeโ€  N/A N/A 51.35 N/A N/A N/A N/A N/A N/A
KoSimCSE-BERT 74.08 74.92 73.98 74.15 74.22 74.07 74.07 74.15 73.14
KoSimCSE-RoBERTa 75.27 75.93 75.00 75.28 75.01 75.17 74.83 75.95 75.01
KoDiffCSE-RoBERTa 77.17 77.73 76.96 77.21 76.89 77.11 76.81 77.74 76.97

Downstream tasks

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License

References

@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}

@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}

@inproceedings{chuang2022diffcse,
   title={{DiffCSE}: Difference-based Contrastive Learning for Sentence Embeddings},
   author={Chuang, Yung-Sung and Dangovski, Rumen and Luo, Hongyin and Zhang, Yang and Chang, Shiyu and Soljacic, Marin and Li, Shang-Wen and Yih, Wen-tau and Kim, Yoon and Glass, James},
   booktitle={Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
   year={2022}
}

More Repositories

1

KoSentenceBERT-ETRI

Sentence Embeddings using Siamese ETRI KoBERT-Networks
Python
147
star
2

KoSentenceBERT-SKT

Sentence Embeddings using Siamese SKT KoBERT-Networks
Python
99
star
3

KoSimCSE-SKT

Simple Contrastive Learning of Korean Sentence Embeddings
Python
41
star
4

Styling-Chatbot-with-Transformer

Language Style๊ณผ ๊ฐ์ •์— ๋”ฐ๋ฅธ ์ฑ—๋ด‡ ๋‹ต๋ณ€ ๋ณ€ํ™” ๋ชจ๋ธ
Python
32
star
5

KoMiniLM

Korean Light Weight Language Model
Python
29
star
6

KoDiffCSE

Difference-based Contrastive Learning for Korean Sentence Embeddings
Python
20
star
7

Dialogue-Generation-BERT-GPT2-Korean

Python
16
star
8

Troll-Detector

Troll Detector
Python
14
star
9

KoBART-summarization-pytorch

๐Ÿง€ KoBART summarization using pytorch
Python
12
star
10

KoChatBART

Korean Chatting BART
Jupyter Notebook
11
star
11

Analyzing-Product-Review-System-with-BERT

Python
10
star
12

Dialogue-Generation-Model-Evaluation

Automatic Evaluation Code for Measuring Dialogue Generation Model Performance
Python
9
star
13

KoSentenceBERT_V2

KoSentenceBERT ๋ชจ๋ธ ๊ตฌ์กฐ ๋ณ€๊ฒฝ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ
CSS
9
star
14

Dialogue-Generation-BERT-GPT2-English

Python
8
star
15

-Personal-study-Deep_Mutual_Learning

Python
8
star
16

Paper-Seminar

Paper Seminar
7
star
17

Knowledge-Distillation-Experiments

Python
7
star
18

KoIR

Korean Information Retrieval
Python
7
star
19

Transformer-Implementation

Python
6
star
20

TF-IDF-with-ArchDaily

Python
6
star
21

-Personal-study-s2s-transformer

-Personal-study-transformer-classification
Python
5
star
22

Question-Difficulty-Estimation

Question Difficulty Estimation
Python
5
star
23

WiseReporter

Python
5
star
24

Simple-NER-Implementation

ํ•œ๊ตญ์–ด ๊ฐœ์ฒด๋ช…์ธ์‹๊ธฐ (BERT based Named Entity Recognition model for Korean)
Python
5
star
25

algorithm

algorithm
Python
4
star
26

distinct-N

Python
4
star
27

-Personal-study-prac-kor-embedding

Python
4
star
28

T-SSKD

Python
4
star
29

My_web

for study
Python
3
star
30

BertForMaskedLM-Performance

BERT MLM ์„ฑ๋Šฅ ์ฒดํฌ
Python
3
star
31

CBCL

Continual Learning
Python
3
star
32

Vocab

Python
3
star
33

Retrieve-and-Refine

3
star
34

Response-Aware-Hybrid-Response-Generator

Python
2
star
35

Response-Aware-Candidate-Retrieval

Code for the IP&M "A Hybrid Response Generation by Response-Aware Candidate Retrieval and Seq-to-seq Generation"
Python
2
star
36

BM-K

2
star
37

CoNKT

Contrastive Neural Korean Text Generation
1
star