awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models
Table of Contents
- About This Repo
- General Framework
- Word Embeddings
- OOV Handling
- Contextualized Word Embeddings
- Pooling Methods
- Encoders
- Evaluation
- Misc
- Vector Mapping
- Articles
About This Repo
- well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
- this repo will also be incomplete, but I'll try my best to find and include all the papers with pretrained models
- this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
- if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
- enjoy!
General Framework
- Almost all the sentence embeddings work like this:
- Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
- Then they define some sort of pooling (it can be as simple as last pooling).
- Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
- So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!
Word Embeddings
- Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code
OOV Handling
- Drop OOV words!
- One OOV vector(unk vector)
- Use subword models(ngram, bpe, char)
- ALaCarte: A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
- Mimick: Mimicking Word Embeddings using Subword RNNs
- CompactReconstruction: Subword-based Compact Reconstruction of Word Embeddings
Contextualized Word Embeddings
- Note: all the unofficial models can load the official pretrained models
Pooling Methods
- {Last, Mean, Max}-Pooling
- Special Token Pooling (like BERT and OpenAI's Transformer)
- SIF: A Simple but Tough-to-Beat Baseline for Sentence Embeddings
- TF-IDF: Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF
- P-norm: Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations
- DisC: A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs
- GEM: Zero-Training Sentence Embedding via Orthogonal Basis
- SWEM: Baseline Needs More Love: On Simple Word-Embedding-Based Modelsand Associated Pooling Mechanisms
- VLAWE: Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation
- Efficient Sentence Embedding using Discrete Cosine Transform
- fse: Gensim add-on for fast sentence embeddings. Supports Mean, Max, SIF, uSIF
- Efficient Sentence Embedding via Semantic Subspace Analysis
Encoders
Evaluation
- decaNLP: The Natural Language Decathlon: Multitask Learning as Question Answering
- SentEval: SentEval: An Evaluation Toolkit for Universal Sentence Representations
- GLUE: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
- Exploring Semantic Properties of Sentence Embeddings
- Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
- Word Embeddings Benchmarks: How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks
- MLDoc: A Corpus for Multilingual Document Classification in Eight Languages
- LexNET: Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model
- wordvectors.net: Community Evaluation and Exchange of Word Vectors at wordvectors.org
- jiant: Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling
- jiant: What do you learn from context? Probing for sentence structure in contextualized word representations
- Evaluation of sentence embeddings in downstream and linguistic probing tasks
- QVEC: Evaluation of Word Vector Representations by Subspace Alignment
- Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments
- EQUATE : A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference
- Evaluating Word Embedding Models: Methods andExperimental Results
- How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions
- Linguistic Knowledge and Transferability of Contextual Representations: contextual-repr-analysis
- LINSPECTOR: Multilingual Probing Tasks for Word Representations
- Pitfalls in the Evaluation of Sentence Embeddings
- Probing Multilingual Sentence Representations With X-Probe: xprobe
Misc
- Word Embedding Dimensionality Selection: On the Dimensionality of Word Embedding
- Half-Size: Simple and Effective Dimensionality Reduction for Word Embeddings
- magnitude: Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package
- To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
- Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors: fuzzymax
- The Pupil Has Become the Master: Teacher-Student Model-BasedWord Embedding Distillation with Ensemble Learning: EmbeddingDistillation
- Improving Distributional Similarity with Lessons Learned from Word Embeddings: hyperwords
- Misspelling Oblivious Word Embeddings: moe
- Single Training Dimension Selection for Word Embedding with PCA
- Compressing Word Embeddings via Deep Compositional Code Learning: neuralcompressor
- UER: An Open-Source Toolkit for Pre-training Models: UER-py
- Situating Sentence Embedders with Nearest Neighbor Overlap
- German BERT
Vector Mapping
- Cross-lingual Word Vectors Projection Using CCA: Improving Vector Space Word Representations Using Multilingual Correlation
- vecmap: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
- MUSE: Unsupervised Machine Translation Using Monolingual Corpora Only
- CrossLingualELMo: Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing
Articles
- Comparing Sentence Similarity Methods
- The Current Best of Universal Word Embeddings and Sentence Embeddings
- On sentence representations, pt. 1: what can you fit into a single #$!%@*&% blog post?
- Deep-learning-free Text and Sentence Embedding, Part 1
- Deep-learning-free Text and Sentence Embedding, Part 2
- An Overview of Sentence Embedding Methods
- Word embeddings in 2017: Trends and future directions
- A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings
- A survey of cross-lingual word embedding models
- Introducing state of the art text classification with universal language models
- Document Embedding Techniques