Ko-Sentence-BERT
Korean SentenceBERT : Sentence Embeddings using Siamese ETRI KoBERT-Networks
Note
๋ค์ํ ๋ฌธ์ฅ ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋ฐ ๊ฒฐ๊ณผ๋ ๋ค์ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํด์ฃผ์ธ์.
[Sentence-Embedding-Is-All-You-Need]
Installation
- ETRI KorBERT๋ transformers 2.4.1 ~ 2.8.0์์๋ง ๋์ํ๊ณ Sentence-BERT๋ 3.1.0 ๋ฒ์ ์ด์์์ ๋์ํ์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์์ ํ์์ต๋๋ค.
- huggingface transformer, sentence transformers, tokenizers ๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ฝ๋๋ฅผ ์ง์ ์์ ํ๋ฏ๋ก ๊ฐ์ํ๊ฒฝ ์ฌ์ฉ์ ๊ถ์ฅํฉ๋๋ค.
- ์ฌ์ฉํ Docker image๋ Docker Hub์ ์ฒจ๋ถํฉ๋๋ค.
- ETRI KoBERT๋ฅผ ์ฌ์ฉํ์ฌ ํ์ตํ์๊ณ ๋ณธ ๋ ํ์งํ ๋ฆฌ์์ ETRI KoBERT๋ฅผ ์ ๊ณตํ์ง ์์ต๋๋ค.
- SKT KoBERT๋ฅผ ์ฌ์ฉํ ๋ฒ์ ์ ๋ค์ ๋ ํ์งํ ๋ฆฌ์ ๊ณต๊ฐ๋์ด ์์ต๋๋ค.
git clone https://github.com/BM-K/KoSentenceBERT.git
python -m venv .KoSBERT
. .KoSBERT/bin/activate
pip install -r requirements.txt
- transformer, tokenizers, sentence_transformers ๋๋ ํ ๋ฆฌ๋ฅผ .KoSBERT/lib/python3.7/site-packages/ ๋ก ์ด๋ํฉ๋๋ค.
- ETRI_KoBERT ๋ชจ๋ธ๊ณผ tokenizer๊ฐ KoSentenceBERT ๋๋ ํ ๋ฆฌ ์์ ์กด์ฌํ์ฌ์ผ ํฉ๋๋ค.
- ETRI ๋ชจ๋ธ๊ณผ tokenizer๋ ๋ค์ ์์์ ๊ฐ์ด ๋ถ๋ฌ์ต๋๋ค :
from ETRI_tok.tokenization_etri_eojeol import BertTokenizer
self.auto_model = BertModel.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch')
self.tokenizer = BertTokenizer.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch/vocab.txt', do_lower_case=False)
Train Models
- ๋ชจ๋ธ ํ์ต์ ์ํ์๋ฉด KoSentenceBERT ๋๋ ํ ๋ฆฌ ์์ KorNLUDatasets์ด ์กด์ฌํ์ฌ์ผ ํฉ๋๋ค.
- STS ํ์ต ์ ๋ชจ๋ธ ๊ตฌ์กฐ์ ๋ง๊ฒ ๋ฐ์ดํฐ๋ฅผ ์์ ํ์ฌ ์ฌ์ฉํ์์ผ๋ฉฐ, ๋ฐ์ดํฐ์ ํ์ต ๋ฐฉ๋ฒ์ ์๋์ ๊ฐ์ต๋๋ค :
KoSentenceBERT/KorNLUDatasets/KorSTS/tune_test.tsv
STS test ๋ฐ์ดํฐ์ ์ ์ผ๋ถ
python training_nli.py # NLI ๋ฐ์ดํฐ๋ก๋ง ํ์ต
python training_sts.py # STS ๋ฐ์ดํฐ๋ก๋ง ํ์ต
python con_training_sts.py # NLI ๋ฐ์ดํฐ๋ก ํ์ต ํ STS ๋ฐ์ดํฐ๋ก Fine-Tuning
Pre-Trained Models
pooling mode๋ MEAN-strategy๋ฅผ ์ฌ์ฉํ์์ผ๋ฉฐ, ํ์ต์ ๋ชจ๋ธ์ output ๋๋ ํ ๋ฆฌ์ ์ ์ฅ ๋ฉ๋๋ค.
๋๋ ํ ๋ฆฌ | ํ์ต๋ฐฉ๋ฒ |
---|---|
training_nli_ETRI_KoBERT-003_bert_eojeol | Only Train NLI |
training_sts_ETRI_KoBERT-003_bert_eojeol | Only Train STS |
training_nli_sts_ETRI_KoBERT-003_bert_eojeol | STS + NLI |
Performance
- Seed ๊ณ ์ , test set
Model | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
---|---|---|---|---|---|---|---|---|
NLl | 67.96 | 70.45 | 71.06 | 70.48 | 71.17 | 70.51 | 64.87 | 63.04 |
STS | 80.43 | 79.99 | 78.18 | 78.03 | 78.13 | 77.99 | 73.73 | 73.40 |
STS + NLI | 80.10 | 80.42 | 79.14 | 79.28 | 79.08 | 79.22 | 74.46 | 74.16 |
- Performance comparison with other models [KLUE-PLMs].
Application Examples
- ์์ฑ ๋ ๋ฌธ์ฅ ์๋ฒ ๋ฉ์ ๋ค์ด ์คํธ๋ฆผ ์ ํ๋ฆฌ์ผ์ด์ ์ ์ฌ์ฉํ ์ ์๋ ๋ฐฉ๋ฒ์ ๋ํ ๋ช ๊ฐ์ง ์๋ฅผ ์ ์ํฉ๋๋ค.
- STS + NLI pretrained ๋ชจ๋ธ์ ํตํด ์งํํฉ๋๋ค.
Semantic Search
SemanticSearch.py๋ ์ฃผ์ด์ง ๋ฌธ์ฅ๊ณผ ์ ์ฌํ ๋ฌธ์ฅ์ ์ฐพ๋ ์์
์
๋๋ค.
๋จผ์ Corpus์ ๋ชจ๋ ๋ฌธ์ฅ์ ๋ํ ์๋ฒ ๋ฉ์ ์์ฑํฉ๋๋ค.
from sentence_transformers import SentenceTransformer, util
import numpy as np
model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'
embedder = SentenceTransformer(model_path)
# Corpus with example sentences
corpus = ['ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค.',
'ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค.',
'๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค.',
'ํ ๋จ์๊ฐ ๋ง์ ํ๋ค.',
'ํ ์ฌ์๊ฐ ๋ฐ์ด์ฌ๋ฆฐ์ ์ฐ์ฃผํ๋ค.',
'๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์์ผ๋ก ๋ฐ์๋ค.',
'ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค.',
'์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค.',
'์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค.']
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = ['ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.',
'๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.',
'์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.']
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
cos_scores = cos_scores.cpu()
#We use np.argpartition, to only partially sort the top_k results
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
for idx in top_results[0:top_k]:
print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
๊ฒฐ๊ณผ๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค :
========================
Query: ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.
Top 5 most similar sentences in corpus:
ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค. (Score: 0.7557)
ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค. (Score: 0.6464)
ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค. (Score: 0.2565)
ํ ๋จ์๊ฐ ๋ง์ ํ๋ค. (Score: 0.2333)
๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์์ผ๋ก ๋ฐ์๋ค. (Score: 0.1792)
========================
Query: ๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.
Top 5 most similar sentences in corpus:
์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค. (Score: 0.6732)
์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค. (Score: 0.3401)
๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์์ผ๋ก ๋ฐ์๋ค. (Score: 0.1037)
ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค. (Score: 0.0617)
๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค. (Score: 0.0466)
=======================
Query: ์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.
Top 5 most similar sentences in corpus:
์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค. (Score: 0.7164)
๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์์ผ๋ก ๋ฐ์๋ค. (Score: 0.3216)
์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค. (Score: 0.2071)
ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค. (Score: 0.1089)
ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค. (Score: 0.0724)
Clustering
Clustering.py๋ ๋ฌธ์ฅ ์๋ฒ ๋ฉ ์ ์ฌ์ฑ์ ๊ธฐ๋ฐ์ผ๋ก ์ ์ฌํ ๋ฌธ์ฅ์ ํด๋ฌ์คํฐ๋งํ๋ ์๋ฅผ ๋ณด์ฌ์ค๋๋ค.
์ด์ ๊ณผ ๋ง์ฐฌ๊ฐ์ง๋ก ๋จผ์ ๊ฐ ๋ฌธ์ฅ์ ๋ํ ์๋ฒ ๋ฉ์ ๊ณ์ฐํฉ๋๋ค.
from sentence_transformers import SentenceTransformer, util
import numpy as np
model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'
embedder = SentenceTransformer(model_path)
# Corpus with example sentences
corpus = ['ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค.',
'ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค.',
'๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค.',
'ํ ๋จ์๊ฐ ๋ง์ ํ๋ค.',
'ํ ์ฌ์๊ฐ ๋ฐ์ด์ฌ๋ฆฐ์ ์ฐ์ฃผํ๋ค.',
'๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์์ผ๋ก ๋ฐ์๋ค.',
'ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค.',
'์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค.',
'์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค.',
'ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.',
'๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.',
'์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.']
corpus_embeddings = embedder.encode(corpus)
# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster ", i+1)
print(cluster)
print("")
๊ฒฐ๊ณผ๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค :
Cluster 1
['๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์์ผ๋ก ๋ฐ์๋ค.', '์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค.', '์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.']
Cluster 2
['ํ ๋จ์๊ฐ ๋ง์ ํ๋ค.', 'ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค.']
Cluster 3
['ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค.', 'ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค.', 'ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.']
Cluster 4
['๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค.', 'ํ ์ฌ์๊ฐ ๋ฐ์ด์ฌ๋ฆฐ์ ์ฐ์ฃผํ๋ค.']
Cluster 5
['์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค.', '๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.']
Downstream Tasks Demo
Citing
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
@article{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
journal= "arXiv preprint arXiv:2004.09813",
month = "04",
year = "2020",
url = "http://arxiv.org/abs/2004.09813",
}
@article{ham2020kornli,
title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
journal={arXiv preprint arXiv:2004.03289},
year={2020}
}