• Stars
    star
    232
  • Rank 172,847 (Top 4 %)
  • Language
  • License
    MIT License
  • Created about 5 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A curated list of resources on document similarity measures (papers, tutorials, code, ...)

Awesome Document Similarity Measures

contributing-image

A curated list of resources, such as papers, tutorials, code, etc., on the topic of document similarity measures.

Please feel free to create a pull request if you would like to add other awesome papers.

Motivation

The goal of this repository is to provide a comprehensive overview for students and reseachers.
Document similarity measures are basis the several downstream applications in the area of natural language processing (NLP) and information retrieval (IR). Among the most common applications are clustering, duplicate or plagirism detection and content-based recommender systems. We selected the following content while having primarily the recommender systems application in mind. In particular, we focus on literature recommender systems that need to assess the similarity of long-form and rich content documents. "Long-form" refers to the amount of document content from +100 sentences, whereas rich content means that documents contain aside from text also images, mathematical equations and citations/links.

Dimensions of Similarity

Documents might be declared as similar, when they, e.g., cover the same topic, use a common set of words or are written in the same font. In IR the dimension of similarity defines the understanding of similarity. We distinguish between the following dimensions: lexical, structural and semantic document similarity.

Moreover, similarity is not a binary decision. In many cases declaring two things as similar or not, is not suitable. Instead, the degree of similarity is measured. Similarity measures express therefore document similarity as normalised scalar score, which is within an interval of zero to one. The highest degree of similarity is measured as one. When two objects are dissimilar, the degree of similarity is zero.

  • D. Lin, “An Information-Theoretic Definition of Similarity,” Proc. ICML, pp. 296–304, 1998.

Lexical Similarity

The lexical document similarity of two documents depends on the words, which occur in the document text. A total overlap between vocabularies would result in a lexical similarity of 1, whereas 0 means both documents share no words. This dimension of similarity can be calculated by a simple word-to-word comparison. Methods like stemming or stop word removal increase the effectiveness of lexical similarity.

  • J. B. Lovins, “Development of a stemming algorithm,” Mech. Transl. Comput. Linguist., vol. 11, no. June, pp. 22–31, 1968.
  • C. Fox, “A stop list for general text,” ACM SIGIR Forum, vol. 24, no. 1–2. pp. 19–21, 1989.

Strutural Similarity

Whereas lexical similarity focuses on the vocabulary, structural similarity describes the conceptual composition of two documents. Structural document similarity stretches from graphical components, like text layout, over similarities in the composition of text segments, e.g., paragraphs or sentences that are lexical similar, to the arrangement of citations and hyperlinks.

Structural similarity is mainly used for semi-structured document formats, such as XML or HTML. A common, yet expensive in computing time, approach is to calculate the minimum cost edit distance between any two document structures. The cost edit distance is the number of actions that are required to change a document so it is identical to another document

  • C. Shin and D. Doermann, “Document image retrieval based on layout structural similarity,” Int. Conf. Image Process. Comput. Vision, Pattern Recognit., pp. 606–612, 2006.
  • D. Buttler, “A short survey of document structure similarity algorithms,” Int. Conf. Internet Comput., pp. 3–9, 2004.

Semantic Similarity

Two documents are considered as semantically similar when they cover related topics or have the same semantic meaning. A proper determination of the semantic similarity is essential for many IR systems, since in many use cases the users information need is rather about the semantic meaning than the vocabulary or structure of a document. However, measuring topical relatedness is a complex task. Therefore, lexical or structural similarity is often used to approximate semantic similarity. The example below shows that approximating semantic by lexical similarity is not always suitable.

  1. Our earth is round.
  2. The world is a globe.

Even if both sentences are lexically different, because they have only one out of five words in common, the semantic meaning of the sentences is synonymous.

Similarity concepts

Text similarity

Psychological

Narratives

Document Representations

In order to compute the similarity of two documents, a numeric representation of the documents must be derived.

These representations are usually vectors of real numbers, whereby the vectors can be either sparse or dense. The terms "sparse" and "dense" refer to the number of zero vs. non-zero elements in the vectors. A sparse vector is one that contains mostly zeros and few non-zero entries. A dense vector contains mostly non-zeros.

In addition, we distinguish document representations based on the type of data that they rely on, e.g., text, topics, citations.

In the context of machine learning approaches that produce dense vector representations, the process is often refered to as learning of document features.

Traditional Text-based

  • Bag-of-Words
  • VSM
  • TF-IDF (IDF variants, BM25,)

Word-level

Word Context

  • Contextualized Word Vectors (CoVe). Paper, Code
  • Embeddings from Language Models (ELMo). Paper
  • Contextual String Embeddings (Zalando Flair). Paper

From word to sentence level

Sentence-level

We find that using a similarity based on angular distance performs better on average than raw cosine similarity.

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

  • ERCNN: Enhanced Recurrent Convolutional Neural Networks for Learning Sentence Similarity Paper. Code Code II

  • DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations Paper Code

BERT and other Transformer Language Models

BERT pooling strategies:

Bert is amazing at encoding texts, just pool the contextualized embeddings. If not pre-training, use the 2nd to last layer as it usually gets better results (see the research in the bert-as-a-service github repo). Reddit

Better BERT embeddings (fix anisotropic problem):

Overcoming BERT's 512 token limit:

The Transformer self-attention mechanism imposes a quadratic cost with respect to sequence length, limiting its wider application, especially for long text.

Document-level

Topic-oriented

  • LSA/LSI
  • LDA
  • LDA2Vec

Citations

Classical

Citation Graph Embeddings (Dense representations)

  • Cite2vec: Citation-Driven Document Exploration via Word Embeddings. Paper

  • hyperdoc2vec: Distributed Representations of Hypertext Documents. Paper

  • Graph Embedding for Citation Recommendation. Paper

General graph embedding methods:

Various implementations:

Mathematical

Hybird

Similarity / Distance Measures

Nearest neighbours in embedding space are considered to be similar.

  • Euclidean Distance

  • Jaccard Similarity

  • Jensen-Shannon distance

  • Cosine similarity: Cosine-similarity treats all dimensions equally.

  • Soft cosine

  • Manhatten distance = L1 norm (see also Manhattan LSTM)

  • Edit distance

  • Levenshtein Distance

  • Word Mover Distance Paper, Paper 2

  • Supervised Word Moving Distance (S-WMD)

Implementations

Siamese Networks

Siamese networks (Bromley, Jane, et al. "Signature verification using a siamese time delay neural network". Advances in neural information processing systems. 1994.) are neural networks containing two or more identical subnetwork components.

It is important that not only the architecture of the subnetworks is identical, but the weights have to be shared among them as well for the network to be called "siamese". The main idea behind siamese networks is that they can learn useful data descriptors that can be further used to compare between the inputs of the respective subnetworks. Hereby, inputs can be anything from numerical data (in this case the subnetworks are usually formed by fully-connected layers), image data (with CNNs as subnetworks) or even sequential data such as sentences or time signals (with RNNs as subnetworks).

Text matching

Tasks

Binary classifcation

Multi-label classification

Loss functions

  • (Binary) Cross Entropy
  • Mean Square Error
  • ...

Concatenations

The two Siamese subnetworks encode input data into u and v. Various approaches exist to then concatenate the encoded input.

  • Reimers, N. and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. The 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019) (2019).

Variations

Paper Concatenation  Comment
InferSent (u;v;|u-v|;u*v) no evaluation
Sentence-BERT (u;v;|u-v|) The most important component is the element-wise difference |u−v| ... The element-wise difference measures the distance between the dimensions of the two sentence embeddings, ensuring that similar pairs are closer and dissimilar pairs are.
Universal Sentence Encoder
Pairwise Document Classification (u;v;|u-v|;u*v) Performaning concatenation (Paper)

MLP on top of Siamese sub networks

A matching network with multi-layer perceptron (MLP) is a standard way to mix multi-dimensions of information (Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks)

Applications

  • Mueller, J. and Thyagarajan, A. 2014. Siamese Recurrent Architectures for Learning Sentence Similarity. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). 2012 (2014), 1386–1393. DOI:https://doi.org/10.1109/CVPR.2014.180.

Benchmarks & Datasets

Performance measures

Surveys

Tutorials

See also

Awesome:

More Repositories

1

pytorch-bert-document-classification

Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)
Jupyter Notebook
158
star
2

scincl

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)
Python
63
star
3

aspect-document-similarity

Implementation, trained models and result data for the paper "Aspect-based Document Similarity for Research Papers" #COLING2020
Jupyter Notebook
62
star
4

llm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
Python
51
star
5

semantic-document-relations

Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Python
32
star
6

legal-document-similarity

Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal Literature Recommendations"
Jupyter Notebook
31
star
7

clp-transfer

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
Python
29
star
8

aspect-document-embeddings

Code, dataset & models for the paper Specialized Document Embeddings for Aspect-based Similarity of Research Papers (#JCDL2022)
Jupyter Notebook
11
star
9

german-language-models

A collection of German GPT language models
10
star
10

awesome-contrastive-learning-for-nlp

A collection of papers about contrastive learning for natural language processing.
7
star
11

wikipedia-article-recommendations

Survey data and Python code for the ICADL 2021 paper "A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles"
Jupyter Notebook
5
star
12

getting-started

Dockerfile
3
star
13

covid-vaccination-appointment

Python
3
star
14

Leaflet.Sim

Leaflet.Sim is a framework for location-based simulations with Leaflet maps that can visualise moving markers, which can change their style, and events over time on a map.
JavaScript
2
star
15

emnlp2022-papers

Python
2
star
16

finetune-evaluation-harness

Python
2
star
17

CmdLineSlideShow

Command line script for generating rich slide shows from a set of images with transition effects and audio. Using ImageMagick and FFMPEG.
Shell
2
star
18

Wikipedia2Lucene

Import a Wikipedia XML Dump from HDFS to Lucene index or Elasticsearch and retrieve similar Wikipedia articles based on Lucene's MoreLikeThis query.
Java
1
star
19

kibana-reallybettermap

Multiple locations for Kibana's bettermap panel
JavaScript
1
star
20

news-visualization

News visualization with Elastic Search and Kibana including NER, Sentiment Analysis and Geo Locations.
Java
1
star
21

data-sourcing

Python
1
star
22

turkish-lm-bias

Investigating Gender Bias in Turkish Language Models
Jupyter Notebook
1
star