• Stars
    star
    133
  • Rank 264,081 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Easily generate document/paragraph/sentence vectors and calculate similarity.

Text2Vec

Easily generate document/paragraph/sentence vectors and calculate similarity.

中文Blog

Goal of this repository is to build a tool to easily generate document/paragraph/sentence vectors for similarity calculation and as input for further machine learning models.

Requirements

  • spacy2.0 (with English model downloaded and installed)
  • gensim
  • numpy

Usage of Text to Vector (text2vec)

  • Initialize: Pre-trained Doc2Vec/Word2Vec model
import text2vec
  • input: List of Documents, doc_list is a list of documents/paragraphs/sentences.
t2v = text2vec.text2vec(doc_list)
  • output: List of Vectors of dimention N

We do such transformation by the following ways.

# Use TFIDF
docs_tfidf = t2v.get_tfidf()

# Use Latent Semantic Indexing(LSI)
docs_lsi = t2v.get_lsi()

# Use Random Projections(RP)
docs_rp = t2v.get_rp()

# Use Latent Dirichlet Allocation(LDA)
docs_lda = t2v.get_lda()

# Use Hierarchical Dirichlet Process(HDP)
docs_hdp = t2v.get_hdp()

# Use Average of Word Embeddings
docs_avgw2v = t2v.avg_wv()

# Use Weighted Word Embeddings wrt. TFIDF
docs_emb = t2v.tfidf_weighted_wv()

For a more detailed introduction of using Weighted Word Embeddings wrt. TFIDF, please read here.

Usage of Similarity Calculation (simical)

For example, we want to calculate the similarity/distance between the first two sentences in the docs_emb we just computed.

Note that cosine similarity is between 0-1 (1 is most similar while 0 is least similar). For the other similarity measurements the results are actually distance (the larget the less similar). It's better to calculate distance for all possible pairs and then rank.

# Initialize
import text2vec
sc = text2vec.simical(docs_emb[0], docs_emb[1])

# Use Cosine
simi_cos = sc.Cosine()

# Use Euclidean
simi_euc = sc.Euclidean()

# Use Triangle's Area Similarity (TS)
simi_ts = sc.Triangle()

# Use Sector's Area Similairity (SS)
simi_ss = sc.Sector()

# Use TS-SS
simi_ts_ss = sc.TS_SS()

Reference

https://radimrehurek.com/gensim/tut2.html

https://github.com/sdimi/average-word2vec

https://github.com/taki0112/Vector_Similarity

More Repositories

1

Awesome-Chinese-NLP

A curated list of resources for Chinese NLP 中文自然语言处理相关资料
7,350
star
2

Information-Extraction-Chinese

Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取
Python
2,113
star
3

Rasa_NLU_Chi

Turn Chinese natural language into structured data 中文自然语言理解
Python
1,446
star
4

Small-Chinese-Corpus

Some useful Chinese corpus datasets 中文语料小数据
519
star
5

Somiao-Pinyin

Somiao Pinyin: Train your own Chinese Input Method with Seq2seq Model 搜喵拼音输入法
Python
252
star
6

Chinese-VQA

Chinese Visual Question Answering 中文看图问答
Python
43
star
7

federated_shap

Code for paper "Interpret Federated Learning with Shapley Values"
Jupyter Notebook
32
star
8

hk_ipo_prediction

Predict first day performance of Hong Kong IPO stocks: A pipeline example of machine learning projects
Jupyter Notebook
24
star
9

Geetest-Captcha-Crack

Geetest Captcha Crack 为了不被怪兽吃掉而奋斗!
Python
21
star
10

aiml_chatbot

AIML based chatbot
Python
20
star
11

lstm_text_generation_chinese

Chinese Text Generation using LSTM
Python
11
star
12

Responsible-AI

This is a demo project of using Responsible AI technology provided by Google to build responsible machine learning applications.
Jupyter Notebook
6
star
13

crownpku.github.io

personal blog
JavaScript
6
star
14

learning_materials

A collection of personal learning materials.
4
star
15

Question_Answering_UI

A Simple UI based on Dash for Question Answering
Python
4
star
16

end_to_end_cnn_captcha

End to end Captcha Crack with CNN
Python
3
star
17

share_everything

wechat public account share_everything code
Python
3
star
18

Awesome-Insurance

A curated list of insurance related technology across the business line
3
star
19

sen_simi_cal

Calculate sentence similarity by word vector
Python
2
star
20

DamageSpreading

C++
1
star
21

CommunityDetection

CommunityDetection
C++
1
star
22

ParallelGA

Objective-C
1
star