• Stars
    star
    4,441
  • Rank 9,650 (Top 0.2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

text2vec, text to vector. 文本向量表征工具,把文本转化为向量矩阵,实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型,开箱即用。

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models


Text2vec: Text to Vector

PyPI version Downloads Contributions welcome License Apache 2.0 python_version GitHub issues Wechat Group

Text2vec: Text to Vector, Get Sentence Embeddings. 文本向量化,把文本(包括词、句子、段落)表征为向量矩阵。

text2vec实现了Word2Vec、RankBM25、BERT、Sentence-BERT、CoSENT等多种文本表征、文本相似度计算模型,并在文本语义匹配(相似度计算)任务上比较了各模型的效果。

News

[2023/07/17] v1.2.2版本: 支持多卡训练,发布了多语言匹配模型shibing624/text2vec-base-multilingual,用CoSENT方法训练,基于sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2用人工挑选后的多语言STS数据集shibing624/nli-zh-all/text2vec-base-multilingual-dataset训练得到,并在中英文测试集评估相对于原模型效果有提升,详见Release-v1.2.2

[2023/06/19] v1.2.1版本: 更新了中文匹配模型shibing624/text2vec-base-chinese-nli为新版shibing624/text2vec-base-chinese-sentence,针对CoSENT的loss计算对排序敏感特点,人工挑选并整理出高质量的有相关性排序的STS数据集shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset,在各评估集表现相对之前有提升;发布了适用于s2p的中文匹配模型shibing624/text2vec-base-chinese-paraphrase,详见Release-v1.2.1

[2023/06/15] v1.2.0版本: 发布了中文匹配模型shibing624/text2vec-base-chinese-nli,基于nghuyong/ernie-3.0-base-zh模型,使用了中文NLI数据集shibing624/nli_zh全部语料训练的CoSENT文本匹配模型,在各评估集表现提升明显,详见Release-v1.2.0

[2022/03/12] v1.1.4版本: 发布了中文匹配模型shibing624/text2vec-base-chinese,基于中文STS训练集训练的CoSENT匹配模型。详见Release-v1.1.4

Guide

Features

文本向量表示模型

  • Word2Vec:通过腾讯AI Lab开源的大规模高质量中文词向量数据(800万中文词轻量版) (文件名:light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe)实现词向量检索,本项目实现了句子(词向量求平均)的word2vec向量表示
  • SBERT(Sentence-BERT):权衡性能和效率的句向量表示模型,训练时通过有监督训练BERT和softmax分类函数,文本匹配预测时直接取句子向量做余弦,句子表征方法,本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
  • CoSENT(Cosine Sentence):CoSENT模型提出了一种排序的损失函数,使训练过程更贴近预测,模型收敛速度和效果比Sentence-BERT更好,本项目基于PyTorch实现了CoSENT模型的训练和预测

详细文本向量表示方法见wiki: 文本向量表示方法

Evaluation

文本匹配

英文匹配数据集的评测结果:

Arch BaseModel Model English-STS-B
GloVe glove Avg_word_embeddings_glove_6B_300d 61.77
BERT bert-base-uncased BERT-base-cls 20.29
BERT bert-base-uncased BERT-base-first_last_avg 59.04
BERT bert-base-uncased BERT-base-first_last_avg-whiten(NLI) 63.65
SBERT sentence-transformers/bert-base-nli-mean-tokens SBERT-base-nli-cls 73.65
SBERT sentence-transformers/bert-base-nli-mean-tokens SBERT-base-nli-first_last_avg 77.96
CoSENT bert-base-uncased CoSENT-base-first_last_avg 69.93
CoSENT sentence-transformers/bert-base-nli-mean-tokens CoSENT-base-nli-first_last_avg 79.68
CoSENT sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 shibing624/text2vec-base-multilingual 80.12

中文匹配数据集的评测结果:

Arch BaseModel Model ATEC BQ LCQMC PAWSX STS-B Avg
SBERT bert-base-chinese SBERT-bert-base 46.36 70.36 78.72 46.86 66.41 61.74
SBERT hfl/chinese-macbert-base SBERT-macbert-base 47.28 68.63 79.42 55.59 64.82 63.15
SBERT hfl/chinese-roberta-wwm-ext SBERT-roberta-ext 48.29 69.99 79.22 44.10 72.42 62.80
CoSENT bert-base-chinese CoSENT-bert-base 49.74 72.38 78.69 60.00 79.27 68.01
CoSENT hfl/chinese-macbert-base CoSENT-macbert-base 50.39 72.93 79.17 60.86 79.30 68.53
CoSENT hfl/chinese-roberta-wwm-ext CoSENT-roberta-ext 50.81 71.45 79.31 61.56 79.96 68.61

说明:

  • 结果评测指标:spearman系数
  • 为评测模型能力,结果均只用该数据集的train训练,在test上评估得到的表现,没用外部数据
  • SBERT-macbert-base模型,是用SBert方法训练,运行examples/training_sup_text_matching_model.py代码可训练模型
  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2模型是用SBert训练,是paraphrase-MiniLM-L12-v2模型的多语言版本,支持中文、英文等

Release Models

  • 本项目release模型的中文匹配评测结果:
Arch BaseModel Model ATEC BQ LCQMC PAWSX STS-B SOHU-dd SOHU-dc Avg QPS
Word2Vec word2vec w2v-light-tencent-chinese 20.00 31.49 59.46 2.57 55.78 55.04 20.70 35.03 23769
SBERT xlm-roberta-base sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 18.42 38.52 63.96 10.14 78.90 63.01 52.28 46.46 3138
CoSENT hfl/chinese-macbert-base shibing624/text2vec-base-chinese 31.93 42.67 70.16 17.21 79.30 70.27 50.42 51.61 3008
CoSENT hfl/chinese-lert-large GanymedeNil/text2vec-large-chinese 32.61 44.59 69.30 14.51 79.44 73.01 59.04 53.12 2092
CoSENT nghuyong/ernie-3.0-base-zh shibing624/text2vec-base-chinese-sentence 43.37 61.43 73.48 38.90 78.25 70.60 53.08 59.87 3089
CoSENT nghuyong/ernie-3.0-base-zh shibing624/text2vec-base-chinese-paraphrase 44.89 63.58 74.24 40.90 78.93 76.70 63.30 63.08 3066
CoSENT sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 shibing624/text2vec-base-multilingual 32.39 50.33 65.64 32.56 74.45 68.88 51.17 53.67 3138

说明:

  • 结果评测指标:spearman系数
  • shibing624/text2vec-base-chinese模型,是用CoSENT方法训练,基于hfl/chinese-macbert-base在中文STS-B数据训练得到,并在中文STS-B测试集评估达到较好效果,运行examples/training_sup_text_matching_model.py代码可训练模型,模型文件已经上传HF model hub,中文通用语义匹配任务推荐使用
  • shibing624/text2vec-base-chinese-sentence模型,是用CoSENT方法训练,基于nghuyong/ernie-3.0-base-zh用人工挑选后的中文STS数据集shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset训练得到,并在中文各NLI测试集评估达到较好效果,运行examples/training_sup_text_matching_model_jsonl_data.py代码可训练模型,模型文件已经上传HF model hub,中文s2s(句子vs句子)语义匹配任务推荐使用
  • shibing624/text2vec-base-chinese-paraphrase模型,是用CoSENT方法训练,基于nghuyong/ernie-3.0-base-zh用人工挑选后的中文STS数据集shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset,数据集相对于shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset加入了s2p(sentence to paraphrase)数据,强化了其长文本的表征能力,并在中文各NLI测试集评估达到SOTA,运行examples/training_sup_text_matching_model_jsonl_data.py代码可训练模型,模型文件已经上传HF model hub,中文s2p(句子vs段落)语义匹配任务推荐使用
  • shibing624/text2vec-base-multilingual模型,是用CoSENT方法训练,基于sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2用人工挑选后的多语言STS数据集shibing624/nli-zh-all/text2vec-base-multilingual-dataset训练得到,并在中英文测试集评估相对于原模型效果有提升,运行examples/training_sup_text_matching_model_jsonl_data.py代码可训练模型,模型文件已经上传HF model hub,多语言语义匹配任务推荐使用
  • w2v-light-tencent-chinese是腾讯词向量的Word2Vec模型,CPU加载使用,适用于中文字面匹配任务和缺少数据的冷启动情况
  • 各预训练模型均可以通过transformers调用,如MacBERT模型:--model_name hfl/chinese-macbert-base 或者roberta模型:--model_name uer/roberta-medium-wwm-chinese-cluecorpussmall
  • 为测评模型的鲁棒性,加入了未训练过的SOHU测试集,用于测试模型的泛化能力;为达到开箱即用的实用效果,使用了搜集到的各中文匹配数据集,数据集也上传到HF datasets链接见下方
  • 中文匹配任务实验表明,pooling最优是EncoderType.FIRST_LAST_AVGEncoderType.MEAN,两者预测效果差异很小
  • 中文匹配评测结果复现,可以下载中文匹配数据集到examples/data,运行tests/test_model_spearman.py代码复现评测结果
  • QPS的GPU测试环境是Tesla V100,显存32GB

模型训练实验报告:实验报告

Demo

Official Demo: https://www.mulanai.com/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip install torch # conda install pytorch
pip install -U text2vec

or

pip install torch # conda install pytorch
pip install -r requirements.txt

git clone https://github.com/shibing624/text2vec.git
cd text2vec
pip install --no-deps .

Usage

文本向量表征

基于pretrained model计算文本向量:

>>> from text2vec import SentenceModel
>>> m = SentenceModel()
>>> m.encode("如何更换花呗绑定银行卡")
Embedding shape: (768,)

example: examples/computing_embeddings_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel
from text2vec import Word2Vec


def compute_emb(model):
    # Embed a list of sentences
    sentences = [
        '卡',
        '银行卡',
        '如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡',
        'This framework generates embeddings for each input sentence',
        'Sentences are passed as a list of string.',
        'The quick brown fox jumps over the lazy dog.'
    ]
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)

    # The result is a list of sentence embeddings as numpy arrays
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding shape:", embedding.shape)
        print("Embedding head:", embedding[:10])
        print()


if __name__ == "__main__":
    # 中文句向量模型(CoSENT),中文语义匹配任务推荐,支持fine-tune继续训练
    t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
    compute_emb(t2v_model)

    # 支持多语言的句向量模型(CoSENT),多语言(包括中英文)语义匹配任务推荐,支持fine-tune继续训练
    sbert_model = SentenceModel("shibing624/text2vec-base-multilingual")
    compute_emb(sbert_model)

    # 中文词向量模型(word2vec),中文字面匹配任务和冷启动适用
    w2v_model = Word2Vec("w2v-light-tencent-chinese")
    compute_emb(w2v_model)

output:

<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)

Sentence: 银行卡
Embedding shape: (768,)
 ... 
  • 返回值embeddingsnumpy.ndarray类型,shape为(sentences_size, model_embedding_size),三个模型任选一种即可,推荐用第一个。
  • shibing624/text2vec-base-chinese模型是CoSENT方法在中文STS-B数据集训练得到的,模型已经上传到huggingface的 模型库shibing624/text2vec-base-chinese, 是text2vec.SentenceModel指定的默认模型,可以通过上面示例调用,或者如下所示用transformers库调用, 模型自动下载到本机路径:~/.cache/huggingface/transformers
  • w2v-light-tencent-chinese是通过gensim加载的Word2Vec模型,使用腾讯词向量Tencent_AILab_ChineseEmbedding.tar.gz计算各字词的词向量,句子向量通过单词词 向量取平均值得到,模型自动下载到本机路径:~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin

Usage (HuggingFace Transformers)

Without text2vec, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load model and predict:

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

Word2Vec词向量

提供两种Word2Vec词向量,任选一个:

下游任务

1. 句子相似度计算

example: examples/semantic_text_similarity_demo.py

import sys

sys.path.append('..')
from text2vec import Similarity

# Two lists of sentences
sentences1 = ['如何更换花呗绑定银行卡',
              'The cat sits outside',
              'A man is playing guitar',
              'The new movie is awesome']

sentences2 = ['花呗更改绑定银行卡',
              'The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

sim_model = Similarity()
for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        score = sim_model.get_score(sentences1[i], sentences2[j])
        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], score))

output:

如何更换花呗绑定银行卡 		 花呗更改绑定银行卡 		 Score: 0.9477
如何更换花呗绑定银行卡 		 The dog plays in the garden 		 Score: -0.1748
如何更换花呗绑定银行卡 		 A woman watches TV 		 Score: -0.0839
如何更换花呗绑定银行卡 		 The new movie is so great 		 Score: -0.0044
The cat sits outside 		 花呗更改绑定银行卡 		 Score: -0.0097
The cat sits outside 		 The dog plays in the garden 		 Score: 0.1908
The cat sits outside 		 A woman watches TV 		 Score: -0.0203
The cat sits outside 		 The new movie is so great 		 Score: 0.0302
A man is playing guitar 		 花呗更改绑定银行卡 		 Score: -0.0010
A man is playing guitar 		 The dog plays in the garden 		 Score: 0.1062
A man is playing guitar 		 A woman watches TV 		 Score: 0.0055
A man is playing guitar 		 The new movie is so great 		 Score: 0.0097
The new movie is awesome 		 花呗更改绑定银行卡 		 Score: 0.0302
The new movie is awesome 		 The dog plays in the garden 		 Score: -0.0160
The new movie is awesome 		 A woman watches TV 		 Score: 0.1321
The new movie is awesome 		 The new movie is so great 		 Score: 0.9591

句子余弦相似度值score范围是[-1, 1],值越大越相似。

2. 文本匹配搜索

一般在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本相似检索等任务。

example: examples/semantic_search_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel, cos_sim, semantic_search

embedder = SentenceModel()

# Corpus with example sentences
corpus = [
    '花呗更改绑定银行卡',
    '我什么时候开通了花呗',
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
    'A woman is playing violin.',
    'Two men pushed carts through the woods.',
    'A man is riding a white horse on an enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = [
    '如何更换花呗绑定银行卡',
    'A man is eating pasta.',
    'Someone in a gorilla costume is playing a set of drums.',
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embedding = embedder.encode(query)
    hits = semantic_search(query_embedding, corpus_embeddings, top_k=5)
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    hits = hits[0]  # Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

output:

Query: 如何更换花呗绑定银行卡
Top 5 most similar sentences in corpus:
花呗更改绑定银行卡 (Score: 0.9477)
我什么时候开通了花呗 (Score: 0.3635)
A man is eating food. (Score: 0.0321)
A man is riding a horse. (Score: 0.0228)
Two men pushed carts through the woods. (Score: 0.0090)

======================
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.6734)
A man is eating a piece of bread. (Score: 0.4269)
A man is riding a horse. (Score: 0.2086)
A man is riding a white horse on an enclosed ground. (Score: 0.1020)
A cheetah is running behind its prey. (Score: 0.0566)

======================
Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.8167)
A cheetah is running behind its prey. (Score: 0.2720)
A woman is playing violin. (Score: 0.1721)
A man is riding a horse. (Score: 0.1291)
A man is riding a white horse on an enclosed ground. (Score: 0.1213)

======================
Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.9147)
A monkey is playing drums. (Score: 0.2655)
A man is riding a horse. (Score: 0.1933)
A man is riding a white horse on an enclosed ground. (Score: 0.1733)
A man is eating food. (Score: 0.0329)

下游任务支持库

similarities库[推荐]

文本相似度计算和文本匹配搜索任务,推荐使用 similarities库 ,兼容本项目release的 Word2vec、SBERT、Cosent类语义匹配模型,还支持字面维度相似度计算、匹配搜索算法,支持文本、图像。

安装: pip install -U similarities

句子相似度计算:

from similarities import Similarity

m = Similarity()
r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')
print(f"similarity score: {float(r)}")  # similarity score: 0.855146050453186

Models

CoSENT model

CoSENT(Cosine Sentence)文本匹配模型,在Sentence-BERT上改进了CosineRankLoss的句向量方案

Network structure:

Training:

Inference:

CoSENT 监督模型

训练和预测CoSENT模型:

  • 在中文STS-B数据集训练和评估CoSENT模型

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-cosent
  • 在蚂蚁金融匹配数据集ATEC上训练和评估CoSENT模型

支持这些中文匹配数据集的使用:'ATEC', 'STS-B', 'BQ', 'LCQMC', 'PAWSX',具体参考HuggingFace datasets https://huggingface.co/datasets/shibing624/nli_zh

python training_sup_text_matching_model.py --task_name ATEC --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/ATEC-cosent
  • 在自有中文数据集上训练模型

example: examples/training_sup_text_matching_model_mydata.py

单卡训练:

CUDA_VISIBLE_DEVICES=0 python training_sup_text_matching_model_mydata.py --do_train --do_predict

多卡训练:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2  training_sup_text_matching_model_mydata.py --do_train --do_predict --output_dir outputs/STS-B-text2vec-macbert-v1 --batch_size 64 --bf16 --data_parallel 

训练集格式参考examples/data/STS-B/STS-B.valid.data

sentence1   sentence2   label
一个女孩在给她的头发做发型。	一个女孩在梳头。	2
一群男人在海滩上踢足球。	一群男孩在海滩上踢足球。	3
一个女人在测量另一个女人的脚踝。	女人测量另一个女人的脚踝。	5

label可以是0,1标签,0代表两个句子不相似,1代表相似;也可以是0-5的评分,评分越高,表示两个句子越相似。模型都能支持。

  • 在英文STS-B数据集训练和评估CoSENT模型

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased  --output_dir ./outputs/STS-B-en-cosent

CoSENT 无监督模型

  • 在英文NLI数据集训练CoSENT模型,在STS-B测试集评估效果

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-cosent

Sentence-BERT model

Sentence-BERT文本匹配模型,表征式句向量表示方案

Network structure:

Training:

Inference:

SentenceBERT 监督模型

  • 在中文STS-B数据集训练和评估SBERT模型

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-sbert
  • 在英文STS-B数据集训练和评估SBERT模型

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-sbert

SentenceBERT 无监督模型

  • 在英文NLI数据集训练SBERT模型,在STS-B测试集评估效果

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-sbert

BERT-Match model

BERT文本匹配模型,原生BERT匹配网络结构,交互式句向量匹配模型

Network structure:

Training and inference:

训练脚本同上examples/training_sup_text_matching_model.py

模型蒸馏(Model Distillation)

由于text2vec训练的模型可以使用sentence-transformers库加载,此处复用其模型蒸馏方法distillation

  1. 模型降维,参考dimensionality_reduction.py使用PCA对模型输出embedding降维,可减少milvus等向量检索数据库的存储压力,还能轻微提升模型效果。
  2. 模型蒸馏,参考model_distillation.py使用蒸馏方法,将Teacher大模型蒸馏到更少layers层数的student模型中,在权衡效果的情况下,可大幅提升模型预测速度。

模型部署

提供两种部署模型,搭建服务的方法: 1)基于Jina搭建gRPC服务【推荐】;2)基于FastAPI搭建原生Http服务。

Jina服务

采用C/S模式搭建高性能服务,支持docker云原生,gRPC/HTTP/WebSocket,支持多个模型同时预测,GPU多卡处理。

  • 安装: pip install jina

  • 启动服务:

example: examples/jina_server_demo.py

from jina import Flow

port = 50001
f = Flow(port=port).add(
    uses='jinahub://Text2vecEncoder',
    uses_with={'model_name': 'shibing624/text2vec-base-chinese'}
)

with f:
    # backend server forever
    f.block()

该模型预测方法(executor)已经上传到JinaHub,里面包括docker、k8s部署方法。

  • 调用服务:
from jina import Client
from docarray import Document, DocumentArray

port = 50001

c = Client(port=port)

data = ['如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡']
print("data:", data)
print('data embs:')
r = c.post('/', inputs=DocumentArray([Document(text='如何更换花呗绑定银行卡'), Document(text='花呗更改绑定银行卡')]))
print(r.embeddings)

批量调用方法见example: examples/jina_client_demo.py

FastAPI服务

  • 安装: pip install fastapi uvicorn

  • 启动服务:

example: examples/fastapi_server_demo.py

cd examples
python fastapi_server_demo.py
  • 调用服务:
curl -X 'GET' \
  'http://0.0.0.0:8001/emb?q=hello' \
  -H 'accept: application/json'

Dataset

  • 本项目release的数据集:
Dataset Introduce Download Link
shibing624/nli-zh-all 中文语义匹配数据合集,整合了文本推理,相似,摘要,问答,指令微调等任务的820万高质量数据,并转化为匹配格式数据集 https://huggingface.co/datasets/shibing624/nli-zh-all
shibing624/snli-zh 中文SNLI和MultiNLI数据集,翻译自英文SNLI和MultiNLI https://huggingface.co/datasets/shibing624/snli-zh
shibing624/nli_zh 中文语义匹配数据集,整合了中文ATEC、BQ、LCQMC、PAWSX、STS-B共5个任务的数据集 https://huggingface.co/datasets/shibing624/nli_zh
or
百度网盘(提取码:qkt6)
or
github
shibing624/sts-sohu2021 中文语义匹配数据集,2021搜狐校园文本匹配算法大赛数据集 https://huggingface.co/datasets/shibing624/sts-sohu2021
ATEC 中文ATEC数据集,蚂蚁金服Q-Qpair数据集 ATEC
BQ 中文BQ(Bank Question)数据集,银行Q-Qpair数据集 BQ
LCQMC 中文LCQMC(large-scale Chinese question matching corpus)数据集,Q-Qpair数据集 LCQMC
PAWSX 中文PAWS(Paraphrase Adversaries from Word Scrambling)数据集,Q-Qpair数据集 PAWSX
STS-B 中文STS-B数据集,中文自然语言推理数据集,从英文STS-B翻译为中文的数据集 STS-B

常用英文匹配数据集:

数据集使用示例:

pip install datasets
from datasets import load_dataset

dataset = load_dataset("shibing624/nli_zh", "STS-B") # ATEC or BQ or LCQMC or PAWSX or STS-B
print(dataset)
print(dataset['test'][0])

output:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 5231
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1458
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1361
    })
})
{'sentence1': '一个女孩在给她的头发做发型。', 'sentence2': '一个女孩在梳头。', 'label': 2}

Contact

  • Issue(建议):GitHub issues
  • 邮件我:xuming: [email protected]
  • 微信我:加我微信号:xuming624, 备注:姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了text2vec,请按如下格式引用:

APA:

Xu, M. Text2vec: Text to vector toolkit (Version 1.1.2) [Computer software]. https://github.com/shibing624/text2vec

BibTeX:

@misc{Text2vec,
  author = {Ming Xu},
  title = {Text2vec: Text to vector toolkit},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shibing624/text2vec}},
}

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加text2vec的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest -v来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

References

More Repositories

1

pycorrector

pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,Qwen2.5等模型应用在纠错场景,开箱即用。
Python
5,536
star
2

MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Python
3,282
star
3

python-tutorial

Python实用教程,包括:Python基础,Python高级特性,面向对象编程,多线程,数据库,数据科学,Flask,爬虫开发教程。
Jupyter Notebook
1,983
star
4

similarity

similarity: Text similarity calculation Toolkit for Java. 文本相似度计算工具包,java编写,可用于文本相似度计算、情感分析等任务,开箱即用。
Java
1,424
star
5

textgen

TextGen: Implementation of Text Generation models, include LLaMA, BLOOM, GPT2, BART, T5, SongNet and so on. 文本生成模型,实现了包括LLaMA,ChatGLM,BLOOM,GPT2,Seq2Seq,BART,T5,UDA等模型的训练和预测,开箱即用。
Python
929
star
6

similarities

Similarities: a toolkit for similarity calculation and semantic search. 相似度计算、匹配搜索工具包,支持亿级数据文搜文、文搜图、图搜图,python3开发,开箱即用。
Python
762
star
7

ChatPDF

RAG for Local LLM, chat with PDF/doc/txt files, ChatPDF. 纯原生实现RAG功能,基于本地LLM、embedding模型、reranker模型实现,无须安装任何第三方agent库。
Python
593
star
8

ChatPilot

ChatPilot: Chat Agent Web UI,实现Chat对话前端,支持Google搜索、文件网址对话(RAG)、代码解释器功能,复现了Kimi Chat(文件,拖进来;网址,发出来)。
Svelte
493
star
9

pytextclassifier

pytextclassifier is a toolkit for text classification. 文本分类,LR,Xgboost,TextCNN,FastText,TextRNN,BERT等分类模型实现,开箱即用。
Python
488
star
10

parrots

Automatic Speech Recognition(ASR), Text-To-Speech(TTS) engine. 中英语音识别、多角色语音合成,支持多语言,准确率高
Python
464
star
11

nlp-tutorial

自然语言处理(NLP)教程,包括:词向量,词法分析,预训练语言模型,文本分类,文本语义匹配,信息抽取,翻译,对话。
Jupyter Notebook
390
star
12

dialogbot

dialogbot, provide search-based dialogue, task-based dialogue and generative dialogue model. 对话机器人,基于问答型对话、任务型对话、聊天型对话等模型实现,支持网络检索问答,领域知识问答,任务引导问答,闲聊问答,开箱即用。
Python
327
star
13

pke_zh

pke_zh, python keyphrase extraction for chinese(zh). 中文关键词或关键句提取工具,实现了KeyBert、PositionRank、TopicRank、TextRank等算法,开箱即用。
Python
188
star
14

lmft

ChatGLM-6B fine-tuning.
Python
135
star
15

nerpy

🌈 NERpy: Implementation of Named Entity Recognition using Python. 命名实体识别工具,支持BertSoftmax、BertSpan等模型,开箱即用。
Python
111
star
16

chatgpt-webui

ChatGPT WebUI using gradio. 给 LLM 对话和检索知识问答RAG提供一个简单好用的Web UI界面
Python
89
star
17

pysenti

Chinese Sentiment Classification Tool. 情感极性分类,基于知网、清华、BosonNLP情感词典,易扩展,基准方法,开箱即用。
Python
85
star
18

companynameparser

company name parser, extract company name brand. 中文公司名称分词工具,支持公司名称中的地名,品牌名(主词),行业词,公司名后缀提取。
Python
82
star
19

agentica

Agentica: Build Multi-Agent Workflow with 3 lines code. 三行代码打造个人助手智能体。
Python
75
star
20

open-o1

open-o1: Using GPT-4o with CoT to Create o1-like Reasoning Chains
Python
61
star
21

CodeAssist

CodeAssist is an advanced code completion tool that provides high-quality code completions for Python, Java, C++ and so on. CodeAssist 是一个高级代码补全工具,高质量为 Python、Java 和 C++ 补全代码。
Python
54
star
22

judger

自动作文评分工具,支持中文、英文作文智能评分,支持评分模型自训练,支持WEKA处理模型数据,支持自定义评分算法。java开发。
Roff
52
star
23

relext

RelExt: A Tool for Relation Extraction from Text. 文本实体关系抽取工具。
Python
48
star
24

github-hot

Tracking the hot Github repos and update daily 每天自动追踪Github热门项目
Python
41
star
25

rater

rater, recommender systems. 推荐模型,包括:DeepFM,Wide&Deep,DIN,DeepWalk,Node2Vec等模型实现,开箱即用。
Python
40
star
26

text-feature

文本特征提取,适用于小说,论文,议论文等文本,提取词语、句子、依存关系等特征。python开发。
Python
39
star
27

pinyin-tokenizer

pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。
Python
26
star
28

labelit

labelit, label tool with active learning, for classification task. 自动标注,基于主动学习,边标注边学习,减少人工标注量。
Python
26
star
29

title-generator

Automatic Text Summarization and Title Generation.
Python
25
star
30

case-analysis

NLP之病历分析:从病历文本之中提取关键信息,便于后续分析处理。
Java
19
star
31

EssaySocring

英文作文自动评分系统,支持评分模型自训练,支持WEKA处理模型数据,支持自定义评分算法。Java开发。
Roff
16
star
32

crf-seg

crf-seg:用于生产环境的中文分词处理工具,可自定义语料、可自定义模型、架构清晰,分词效果好。java编写。
Java
13
star
33

text2vec-service

Service for Bert model to Vector. 高效的文本转向量(Text-To-Vector)服务,支持GPU多卡、多worker、多客户端调用,开箱即用。
Python
10
star
34

weibo-roast

一个微博毒舌AI,疯狂 diss 微博博主
Python
10
star
35

authorship-identification

【今日头条】文本作者身份识别比赛
Jupyter Notebook
9
star
36

fake-news-detector

Fake News Detection Competition
Python
8
star
37

zh-normalization

Chinese(zh) sentence NSW(Non-Standard-Word) Normalization
Python
8
star
38

ChatGPT-API-server

build a python server for ChatGPT API.
Python
7
star
39

cpp-tutorial

C++开发实例教程,基础,开源库进阶,高级技巧。
C++
5
star
40

nlpcommon

NLP common tools.
Python
5
star
41

cvnet

have fun with image AI
Jupyter Notebook
4
star
42

text2vec-encoder

**Text2vecEncoder** wraps the text2vec model with jina. It encodes text data into dense vectors.
Python
4
star
43

BlogDemo

我的csdn博客中使用的代码,主要是算法。
Java
3
star
44

sbert

sbert, sentence bert.
Python
2
star
45

shibing624

2
star
46

Diffusion-Tuning

Diffusion-Tuning: Training Your Own Diffusion model with custom dataset.
Python
2
star
47

tools

tools
JavaScript
2
star
48

pyweb

Web server use tornado.
Python
1
star
49

html5-demos

Use the html5 to show funny web demos
JavaScript
1
star
50

little-spring

理解spring核心代码,自己仿写spring,实现简化功能。
Java
1
star
51

phrase-search

短语搜索,支持公司名称、地址名称等短语的搜索,支持自定义排序、拼音处理,内置jetty提供web接口。java编写。
Java
1
star