awesome-metric-learning
😎 Awesome list about practical Metric Learning and its applications
Motivation 🤓
At Qdrant, we have one goal: make metric learning more practical. This listing is in line with this purpose, and we aim at providing a concise yet useful list of awesomeness around metric learning. It is intended to be inspirational for productivity rather than serve as a full bibliography.
If you find it useful or like it in some other way, you may want to join our Discord server, where we are running a paper reading club on metric learning.
Contributing 🤩
If you want to contribute to this project, but don't know how, you may want to check out the contributing guide. It's easy! 😌
Surveys 📖
What is Metric Learning? - A beginner-friendly starting point for traditional metric learning methods from scikit-learn website.
It has proceeding guides for supervised, weakly supervised and unsupervised metric learning algorithms in
metric_learn
package.
Deep Metric Learning: A Survey - A comprehensive study for newcomers.
Factors such as sampling strategies, distance metrics, and network structures are systematically analyzed by comparing the quantitative results of the methods.
Deep Metric Learning: A (Long) Survey - An intuitive survey of the state-of-the-art.
It discusses the need for metric learning, old and state-of-the-art approaches, and some real-world use cases.
A Tutorial on Distance Metric Learning: Mathematical Foundations, Algorithms, Experimental Analysis, Prospects and Challenges (with Appendices on Mathematical Background and Detailed Algorithms Explanation) - Intended for those interested in mathematical foundations of metric learning.
Neural Approaches to Conversational Information Retrieval - A working draft of a 150-page survey book by Microsoft researchers
Applications 🎮
CLIP - Training a unified vector embedding for image and text. NLP
CV
CLIP offers state-of-the-art zero-shot image classification and image retrieval with a natural language query. See demo.
Wav2CLIP - Encoding audio into the same vector space as CLIP. Audio
This work achieves zero-shot classification and cross-modal audio retrieval from natural language queries.
Detic - Code released for "Detecting Twenty-thousand Classes using Image-level Supervision". CV
It is an open-class object detector to detect any label encoded by CLIP without finetuning. See demo.
GTR - Collection of Generalizable T5-based dense Retrievers (GTR) models. NLP
TensorFlow Hub offers a collection of pretrained models from the paper Large Dual Encoders Are Generalizable Retrievers. GTR models are first initialized from a pre-trained T5 checkpoint. They are then further pre-trained with a set of community question-answer pairs. Finally, they are fine-tuned on the MS Marco dataset. The two encoders are shared so the GTR model functions as a single text encoder. The input is variable-length English text and the output is a 768-dimensional vector.
TARS - Task-aware representation of sentences, a novel method for several zero-shot tasks including NER. NLP
The method and pretrained models found in Flair go beyond zero-shot sequence classification and offers zero-shot span tagging abilities for tasks such as named entity recognition and part of speech tagging.
BERTopic - A novel topic modeling toolkit with BERT embeddings. NLP
It leverages HuggingFace Transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions. It supports guided, (semi-) supervised, and dynamic topic modeling beautiful visualizations.
XRD Identifier - Fingerprinting substances with metric learning
Identification of substances based on spectral analysis plays a vital role in forensic science. Similarly, the material identification process is of paramount importance for malfunction reasoning in manufacturing sectors and materials research. This models enables to identify materials with deep metric learning applied to X-Ray Diffraction (XRD) spectrum. Read this post for more background.
Semantic Code Search - Retrieving relevant code snippets given a natural language query. NLP
Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic concepts and semantics. The repository provides the pretrained models and source code for Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus, where they apply several tricks to achieve this.
DUCH: Deep Unsupervised Contrastive Hashing - Large-scale cross-modal text-image retrieval in remote sensing with computer vision. CV
NLP
DUration: Deep Unsupervised Representation for Heterogeneous Recommendation - Recommending different types of items efficiently. RecSys
State-of-the-art methods are incapable of leveraging attributes from different types of items and thus suffer from data sparsity problems because it is quite challenging to represent items with different feature spaces jointly. To tackle this problem, they propose a kernel-based neural network, namely deep unified representation (DURation) for heterogeneous recommendation, to jointly model unified representations of heterogeneous items while preserving their original feature space topology structures. See paper.
Item2Vec - Word2Vec-inspired model for item recommendation. RecSys
It provides the implementation of Item2Vec: Neural Item Embedding for Collaborative Filtering, wrapped as a
sklearn
estimator compatible withGridSearchCV
andBayesSearchCV
for hyperparameter tuning.
Earworm - Search for royalty-free commercial-use music by sonic similarity
You can search for the overall closest fit, or choose to focus matching genre, mood, or instrumentation.
DensePhrases - a text retrieval model that can return phrases, sentences, passages, or documents for your natural language queries. NLP
It searches phrase-level answers to your questions in real-time or retrieves passages for downstream tasks. Check out demo, or see paper.
Alt-ZSC - An alternate implementation for zero-shot text classification. NLP
Instead of leveraging NLI/XNLI, they make use of the text encoder of the CLIP model, concluding from casual experiments that this sometimes gives better accuracy than NLI-based models.
CLMR - Contrastive learning of musical representations
Application of the SimCLR method to musical data with out-of-domain generalization in million-scale music classification. See demo or paper.
Case Studies ✍️
Libraries 🧰
Quaterion - Blazing fast framework for fine-tuning similarity learning models
Quaterion is a framework for fine-tuning similarity learning models. The framework closes the "last mile" problem in training models for semantic search, recommendations, anomaly detection, extreme classification, matching engines, e.t.c. It is designed to combine the performance of pre-trained models with specialization for the custom task while avoiding slow and costly training.
sentence-transformers - A library for
sentence-level embeddings. NLP
Developed on top of the well-known Transformers library, it provides an easy way to finetune Transformer-based models to obtain sequence-level embeddings.
OpenMetricLearning - PyTorch-based framework to train and validate the models producing high-quality embeddings. CV
MatchZoo - a collection of deep learning models for matching documents. NLP
The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase identification.
pytorch-metric-learning - A modular library implementing losses, miners, samplers and trainers in PyTorch.
tensorflow-similarity - A metric learning library in TensorFlow with a Keras-like API.
It provides support for self-supervised contrastive learning and state-of-the-art methods such as SimCLR, SimSian, and Barlow Twins.
sense2vec - Contextually keyed word vectors. NLP
A PyTorch library to train and inference with contextually-keyed word vectors augmented with part-of-speech tags to achieve multi-word queries.
lightly - A Python library for self-supervised learning on images. CV
A PyTorch library to efficiently train self-supervised computer vision models with state-of-the-art techniques such as SimCLR, SimSian, Barlow Twins, BYOL, among others.
MTEB - Massive Text Embedding Benchmark. NLP
A library that helps you benchmark pretrained and custom embedding models on tens of datasets and tasks with ease.
LightFM - A Python implementation of a number of popular
recommender algorithms. RecSys
It supports incorporating user and item features to the traditional matrix factorization. It represents users and items as a sum of the latent representations of their features, thus achieving a better generalization.
gensim - Library for topic modelling, document indexing and similarity retrieval with large corpora
It provides efficient multicore and memory-independent implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec.
DasyRec - A library for recommender system development in pytorch. RecSys
It provides implementations of algorithms such as KNN, LFM, SLIM, NeuMF, FM, DeepFM, VAE and so on, in order to ensure fair comparison of recommender system benchmarks.
Tools ⚒️
Embedding Projector - A web-based tool to visualize high-dimensional data.
It supports UMAP, T-SNE, PCA, or custom techniques to analyze embeddings of encoders.
Parallax - a tool for visualizing embeddings
It allows you to visualize the embedding space selecting explicitly the axis through algebraic formulas on the embeddings (like king-man+woman) and highlight specific items in the embedding space. It also supports implicit axes via PCA and t-SNE. See paper.
Processing Text Data - An optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow). NLP
Approximate Nearest Neighbors ⚡
ANN Benchmarks - Benchmarking various ANN implementations for different metrics.
It provides benchmarking of 20+ ANN algorithms on nine standard datasets with support to bring your dataset. (Medium Post)
FAISS - Efficient similarity search and clustering of dense vectors that possibly do not fit in RAM
It is not the fastest ANN algorithm but achieves memory efficiency thanks to various quantization and indexing methods such as IVF, PQ, and IVF-PQ. (Tutorial)
HNSW - Hierarchical Navigable Small World graphs
It is still one of the fastest ANN algorithms out there, requiring relatively a higher memory usage. (Paper: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs)
Google's SCANN - The technology behind vector search at Google
Paper: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Papers 🔬
Dimensionality Reduction by Learning an Invariant Mapping - First appearance of Contrastive Loss.
Published by Yann Le Cun et al. (2005), its main focus was on dimensionality reduction. However, the method proposed has excellent properties for metric learning such as preserving neighbourhood relationships and generalization to unseen data, and it has extensive applications with a great number of variations ever since. It is advised that you read this great post to better understand its importance for metric learning.
FaceNet: A Unified Embedding for Face Recognition and Clustering - First appearance of Triplet Loss.
The paper introduces Triplet Loss, which can be seen as the "ImageNet moment" for deep metric learning. It is still one of the state-of-the-art methods and has a great number of applications in almost any data modality.
In Defense of the Triplet Loss for Person Re-Identification - It shows that triplet sampling matters and proposes to use batch-hard samples.
Deep Metric Learning with Angular Loss - A novel loss function with better properties.
It provides scale invariance, robustness against feature variance, and better convergence than Contrastive and Triplet Loss.
ArcFace: Additive Angular Margin Loss for Deep Face Recognition > Supervised metric learning without pairs or triplets.
Although it is originally designed for the face recognition task, this loss function achieves state-of-the-art results in many other metric learning problems with a simpler and faster data feeding. It is also robust against unclean and unbalanced data when modified with sub-centers and a dynamic margin.
Learning Distance Metrics from Probabilistic Information - Working with datasets that contain probabilistic labels instead of deterministic values.
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning - Better regularization for high-dimensional embeddings.
The paper introduces a method that explicitly avoids the collapse problem in high dimensions with a simple regularization term on the variance of the embeddings along each dimension individually. This new term can be incorporated into other methods to stabilize the training and performance improvements.
On the Unreasonable Effectiveness of Centroids in Image Retrieval - Higher robustness against outliers with better efficiency.
The paper proposes using the mean centroid representation during training and retrieval for robustness against outliers and more stable features. It further reduces retrieval time and storage requirements, making it suitable for production deployments.
TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning - A SOTA method to learn domain-specific sentence-level embeddings from unlabelled data.
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations - Self-Supervised method comparing two differently augmented versions of the same image with Contrastive Loss. CV
It demonstrates among other things that
- composition of data augmentations plays a critical role - Random Crop + Random Color distortion provides the best downstream classifier accuracy,
- introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations,
- and Contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
SimCSE: Simple Contrastive Learning of Sentence Embeddings - An unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. NLP
They also incorporates annotated pairs from natural language inference datasets into their contrastive learning framework in a supervised setting, showing that contrastive learning objective regularizes pre-trained embeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
Learning Transferable Visual Models From Natural Language Supervision - The paper that introduced CLIP: Training a unified vector embedding for image and text. NLP
CV
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision - Google's answer to CLIP: Training a unified vector embedding for image and text but using noisy text instead of a carefully curated dataset. NLP
CV
Cross-Batch Memory for Embedding Learning (XBM) - A technique aimed to extend batch sizes for similarity losses, without actually evaluating all embeddings in a single batch.
Mining informative negative instances are of central importance to deep metric learning (DML), however this task is intrinsically limited by mini-batch training, where only a mini-batch of instances is accessible at each iteration. In this paper, we identify a "slow drift" phenomena by observing that the embedding features drift exceptionally slow even as the model parameters are updating throughout the training process. This suggests that the features of instances computed at preceding iterations can be used to considerably approximate their features extracted by the current model.
Datasets ℹ️
Practitioners can use any labeled or unlabelled data for metric learning with an appropriate method chosen. However, some datasets are particularly important in the literature for benchmarking or other ways, and we list them in this section.
SNLI - The Stanford Natural Language Inference Corpus,
serving as a useful benchmark. NLP
The dataset contains pairs of sentences labeled as
contradiction
,entailment
, andneutral
regarding semantic relationships. Useful to train semantic search models in metric learning.
MultiNLI - NLI corpus with samples from multiple genres. NLP
Modeled on the SNLI corpus, the dataset contains sentence pairs from various genres of spoken and written text, and it also offers a distinctive cross-genre generalization evaluation.
Google Landmark Recognition 2019 - Label famous (and no so famous) landmarks from images. CV
Shared as a part of a Kaggle competition by Google, this dataset is more diverse and thus more interesting than the first version.
Fashion-MNIST - a dataset of Zalando's article images. CV
The dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
The Stanford Online Products dataset - dataset has 22,634 classes with 120,053 product images. CV
The dataset is published along with "Deep Metric Learning via Lifted Structured Feature Embedding" paper.
MetaAI's 2021 Image Similarity Dataset and Challenge - dataset has 1M Reference image set, 1M Training image set, 50K Dev query image set and 50K Test query image set. CV
The dataset is published along with "The 2021 Image Similarity Dataset and Challenge" paper.