Awesome Document Similarity Measures

A curated list of resources, such as papers, tutorials, code, etc., on the topic of document similarity measures.

Please feel free to create a pull request if you would like to add other awesome papers.

Motivation

The goal of this repository is to provide a comprehensive overview for students and reseachers.
Document similarity measures are basis the several downstream applications in the area of natural language processing (NLP) and information retrieval (IR). Among the most common applications are clustering, duplicate or plagirism detection and content-based recommender systems. We selected the following content while having primarily the recommender systems application in mind. In particular, we focus on literature recommender systems that need to assess the similarity of long-form and rich content documents. "Long-form" refers to the amount of document content from +100 sentences, whereas rich content means that documents contain aside from text also images, mathematical equations and citations/links.

Dimensions of Similarity

Documents might be declared as similar, when they, e.g., cover the same topic, use a common set of words or are written in the same font. In IR the dimension of similarity defines the understanding of similarity. We distinguish between the following dimensions: lexical, structural and semantic document similarity.

Moreover, similarity is not a binary decision. In many cases declaring two things as similar or not, is not suitable. Instead, the degree of similarity is measured. Similarity measures express therefore document similarity as normalised scalar score, which is within an interval of zero to one. The highest degree of similarity is measured as one. When two objects are dissimilar, the degree of similarity is zero.

D. Lin, “An Information-Theoretic Definition of Similarity,” Proc. ICML, pp. 296–304, 1998.

Lexical Similarity

The lexical document similarity of two documents depends on the words, which occur in the document text. A total overlap between vocabularies would result in a lexical similarity of 1, whereas 0 means both documents share no words. This dimension of similarity can be calculated by a simple word-to-word comparison. Methods like stemming or stop word removal increase the effectiveness of lexical similarity.

J. B. Lovins, “Development of a stemming algorithm,” Mech. Transl. Comput. Linguist., vol. 11, no. June, pp. 22–31, 1968.
C. Fox, “A stop list for general text,” ACM SIGIR Forum, vol. 24, no. 1–2. pp. 19–21, 1989.

Strutural Similarity

Whereas lexical similarity focuses on the vocabulary, structural similarity describes the conceptual composition of two documents. Structural document similarity stretches from graphical components, like text layout, over similarities in the composition of text segments, e.g., paragraphs or sentences that are lexical similar, to the arrangement of citations and hyperlinks.

Structural similarity is mainly used for semi-structured document formats, such as XML or HTML. A common, yet expensive in computing time, approach is to calculate the minimum cost edit distance between any two document structures. The cost edit distance is the number of actions that are required to change a document so it is identical to another document

C. Shin and D. Doermann, “Document image retrieval based on layout structural similarity,” Int. Conf. Image Process. Comput. Vision, Pattern Recognit., pp. 606–612, 2006.
D. Buttler, “A short survey of document structure similarity algorithms,” Int. Conf. Internet Comput., pp. 3–9, 2004.

Semantic Similarity

Two documents are considered as semantically similar when they cover related topics or have the same semantic meaning. A proper determination of the semantic similarity is essential for many IR systems, since in many use cases the users information need is rather about the semantic meaning than the vocabulary or structure of a document. However, measuring topical relatedness is a complex task. Therefore, lexical or structural similarity is often used to approximate semantic similarity. The example below shows that approximating semantic by lexical similarity is not always suitable.

Our earth is round.
The world is a globe.

Even if both sentences are lexically different, because they have only one out of five words in common, the semantic meaning of the sentences is synonymous.

Similarity concepts

Text similarity

Psychological

Narratives

Nguyen, D., Trieschnigg, D., & Theune, M. (2014). Using crowdsourcing to investigate perception of narrative similarity. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management, 321–330.

Document Representations

In order to compute the similarity of two documents, a numeric representation of the documents must be derived.

These representations are usually vectors of real numbers, whereby the vectors can be either sparse or dense. The terms "sparse" and "dense" refer to the number of zero vs. non-zero elements in the vectors. A sparse vector is one that contains mostly zeros and few non-zero entries. A dense vector contains mostly non-zeros.

In addition, we distinguish document representations based on the type of data that they rely on, e.g., text, topics, citations.

In the context of machine learning approaches that produce dense vector representations, the process is often refered to as learning of document features.

Traditional Text-based

Bag-of-Words
VSM
TF-IDF (IDF variants, BM25,)

Word-level

Word2Vec. Paper
Glove. Paper. Code
FastText Paper. Code

Word Context

Contextualized Word Vectors (CoVe). Paper, Code
Embeddings from Language Models (ELMo). Paper
Contextual String Embeddings (Zalando Flair). Paper

From word to sentence level

Average
Weighted Average
Smooth Inverse Frequency. A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Sentence-level

Skip-thoughts. Paper. Code
Quick-Thoughts. Paper
Universal Sentence Encoder. Paper

We find that using a similarity based on angular distance performs better on average than raw cosine similarity.

InferSent. Paper Code

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

ERCNN: Enhanced Recurrent Convolutional Neural Networks for Learning Sentence Similarity Paper. Code Code II
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations Paper Code

BERT and other Transformer Language Models

BERT Paper
GPT Paper
Generative Pre-Training-2 (GPT-2) Paper
Universal Language Model Fine-tuning (ULMFiT) Paper
XLNet
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Paper Code
Sentence Transformers
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., & Li, L. (2020). On the Sentence Embeddings from Pre-trained Language Models. EMNLP 2020. (Code)

BERT pooling strategies:

Bert is amazing at encoding texts, just pool the contextualized embeddings. If not pre-training, use the 2nd to last layer as it usually gets better results (see the research in the bert-as-a-service github repo). Reddit

https://github.com/hanxiao/bert-as-service#q-what-are-the-available-pooling-strategies

Better BERT embeddings (fix anisotropic problem):

Overcoming BERT's 512 token limit:

The Transformer self-attention mechanism imposes a quadratic cost with respect to sequence length, limiting its wider application, especially for long text.

Long-form document classification with BERT. Blogpost, Code
See ICLR 2020 reviews:
- BERT-AL: BERT for Arbitrarily Long Document Understanding
- Blockwise Self-Attention for Long Document Understanding
Easy-to-use interface to fine-tuned BERT models for computing semantic similarity
Ye, Z. et al. 2019. BP-Transformer: Modelling Long-Range Context via Binary Partitioning. (2019). Paper Code
Longformer: The Long-Document Transformer Code Paper
Natural Language Recommendations: A novel research paper search engine developed entirely with embedding and transformer models.

Document-level

Doc2Vec Paper, Paper 2
Fuzzy Bag-of-Words Model for Document Representation. Paper

Topic-oriented

LSA/LSI
LDA
LDA2Vec

Citations

Classical

Bibliographic coupling. Martyn, J (1964)
Co-Citation. Small, Henry (1973)
Co-Citation Proximity Analysis. Gipp, Bela; Beel, Joeran (2006)
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?

Citation Graph Embeddings (Dense representations)

Cite2vec: Citation-Driven Document Exploration via Word Embeddings. Paper
hyperdoc2vec: Distributed Representations of Hypertext Documents. Paper
Graph Embedding for Citation Recommendation. Paper

General graph embedding methods:

DeepWalk: Online Learning of Social Representations. Paper
LINE: Large-scale Information Network Embedding. Paper. Code
node2vec. Paper
Poincare Embeddings. Nickel and Kiela (2017) Gensim Implementation

Various implementations:

Mathematical

Hybird

Concatenate Text + Citation Embeddings: Evaluating Document Representations for Content-based Legal Literature Recommendations
Citation Prediction as Training Objective: SPECTER: Document-level Representation Learning using Citation-informed Transformers

Similarity / Distance Measures

Nearest neighbours in embedding space are considered to be similar.

Euclidean Distance
Jaccard Similarity
Jensen-Shannon distance
Cosine similarity: Cosine-similarity treats all dimensions equally.
Soft cosine
Manhatten distance = L1 norm (see also Manhattan LSTM)
Edit distance
Levenshtein Distance
Word Mover Distance Paper, Paper 2
Supervised Word Moving Distance (S-WMD)

Implementations

textdistance (Python, many algorithms)

Siamese Networks

Siamese networks (Bromley, Jane, et al. "Signature verification using a siamese time delay neural network". Advances in neural information processing systems. 1994.) are neural networks containing two or more identical subnetwork components.

It is important that not only the architecture of the subnetworks is identical, but the weights have to be shared among them as well for the network to be called "siamese". The main idea behind siamese networks is that they can learn useful data descriptors that can be further used to compare between the inputs of the respective subnetworks. Hereby, inputs can be anything from numerical data (in this case the subnetworks are usually formed by fully-connected layers), image data (with CNNs as subnetworks) or even sequential data such as sentences or time signals (with RNNs as subnetworks).

Text matching

Simple and Effective Text Matching with Richer Alignment Features (Code)

Tasks

Binary classifcation

Multi-label classification

Loss functions

(Binary) Cross Entropy
Mean Square Error
...

Concatenations

The two Siamese subnetworks encode input data into u and v. Various approaches exist to then concatenate the encoded input.

Reimers, N. and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. The 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019) (2019).

Variations

Paper	Concatenation	Comment
InferSent	`(u;v;\|u-v\|;u*v)`	no evaluation
Sentence-BERT	`(u;v;\|u-v\|)`	The most important component is the element-wise difference `\|u−v\|` ... The element-wise difference measures the distance between the dimensions of the two sentence embeddings, ensuring that similar pairs are closer and dissimilar pairs are.
Universal Sentence Encoder
Pairwise Document Classification	`(u;v;\|u-v\|;u*v)`	Performaning concatenation (Paper)

MLP on top of Siamese sub networks

A matching network with multi-layer perceptron (MLP) is a standard way to mix multi-dimensions of information (Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks)

Applications

Mueller, J. and Thyagarajan, A. 2014. Siamese Recurrent Architectures for Learning Sentence Similarity. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). 2012 (2014), 1386–1393. DOI:https://doi.org/10.1109/CVPR.2014.180.

Benchmarks & Datasets

Large expert-curated database for benchmarking document similarity detection in biomedical literature search. Paper, Data
SciDocs - The Dataset Evaluation Suite for SPECTER (for classification, citation prediction, user activity, recommendation)
CSFCube -- A Test Collection of Computer Science Research Articles for Faceted Query by Example
STSbenchmark
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications

Performance measures

Surveys

Tutorials

Training State-of-the-Art Text Embedding & Neural Search Models

malteos/awesome-document-similarity

malteos

Reviews

Repository Details