• Stars
    star
    1,192
  • Rank 39,244 (Top 0.8 %)
  • Language
    C++
  • License
    Other
  • Created over 7 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

General purpose unsupervised sentence representations

Updates

Code and pre-trained models related to the Bi-Sent2vec, cross-lingual extension of Sent2Vec can be found here.

Sent2vec

TLDR: This library provides numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task.

Table of Contents

Setup and Requirements

Our code builds upon Facebook's FastText library, see also their nice documentation and python interfaces.

To compile the library, simply run a make command.

A Cython module allows you to keep the model in memory while inferring sentence embeddings. In order to compile and install the module, run the following from the project root folder:

pip install .

Note -

if you install sent2vec using

$ pip install sent2vec

then you'll get the wrong package. Please follow the instructions in the README.md to install it correctly.

Sentence Embeddings

For the purpose of generating sentence representations, we introduce our sent2vec method and provide code and models. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences.

The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the paper for more details.

Generating Features from Pre-Trained Models

Directly from Python

If you've installed the Cython module, you can infer sentence embeddings while keeping the model in memory:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .") 
embs = model.embed_sentences(["first sentence .", "another sentence"])

Text preprocessing (tokenization and lowercasing) is not handled by the module, check wikiTokenize.py for tokenization using NLTK and Stanford NLP.

An alternative to the Cython module is using the python code provided in the get_sentence_embeddings_from_pre-trained_models notebook. It handles tokenization and can be given raw sentences, but does not keep the model in memory.

Running Inference with Multiple Processes

There is an 'inference' mode for loading the model in the Cython module, which loads the model's input matrix into a shared memory segment and doesn't load the output matrix, which is not needed for inference. This is an optimization for the usecase of running inference with multiple independent processes, which would otherwise each need to load a copy of the model into their address space. To use it:

model.load_model('model.bin', inference_mode=True)

The model is loaded into a shared memory segment named after the model name. The model will stay in memory until you explicitely remove the shared memory segment. To do it from Python:

model.release_shared_mem('model.bin')

Using the Command-line Interface

Given a pre-trained model model.bin (download links see below), here is how to generate the sentence features for an input text. To generate the features, use the print-sentence-vectors command and the input text file needs to be provided as one sentence per line:

./fasttext print-sentence-vectors model.bin < text.txt

This will output sentence vectors (the features for each input sentence) to the standard output, one vector per line. This can also be used with pipes:

cat text.txt | ./fasttext print-sentence-vectors model.bin

Downloading Sent2vec Pre-Trained Models

(as used in the NAACL2018 paper)

Note: users who downloaded models prior to this release will encounter compatibility issues when trying to use the old models with the latest commit. Those users can still use the code in the release to keep using old models.

Tokenizing

Both feature generation as above and also training as below do require that the input texts (sentences) are already tokenized. To tokenize and preprocess text for the above models, you can use

python3 tweetTokenize.py <tweets_folder> <dest_folder> <num_process>

for tweets, or then the following for wikipedia:

python3 wikiTokenize.py corpora > destinationFile

Note: For wikiTokenize.py, set the SNLP_TAGGER_JAR parameter to be the path of stanford-postagger.jar which you can download here

Train a New Sent2vec Model

To train a new sent2vec model, you first need some large training text file. This file should contain one sentence per line. The provided code does not perform tokenization and lowercasing, you have to preprocess your input data yourself, see above.

You can then train a new model. Here is one example of command:

./fasttext sent2vec -input wiki_sentences.txt -output my_model -minCount 8 -dim 700 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000 -maxVocabSize 750000 -numCheckPoints 10

Here is a description of all available arguments:

sent2vec -input train.txt -output model

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -lr                 learning rate [0.2]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                dimension of word and sentence vectors [100]
  -epoch              number of epochs [5]
  -minCount           minimal number of word occurences [5]
  -minCountLabel      minimal number of label occurences [0]
  -neg                number of negatives sampled [10]
  -wordNgrams         max length of word ngram [2]
  -loss               loss function {ns, hs, softmax} [ns]
  -bucket             number of hash buckets for vocabulary [2000000]
  -thread             number of threads [2]
  -t                  sampling threshold [0.0001]
  -dropoutK           number of ngrams dropped when training a sent2vec model [2]
  -verbose            verbosity level [2]
  -maxVocabSize       vocabulary exceeding this size will be truncated [None]
  -numCheckPoints     number of intermediary checkpoints to save when training [1]

Nearest Neighbour Search and Analogies

Given a pre-trained model model.bin , here is how to use these features. For the nearest neighbouring sentence feature, you need the model as well as a corpora in which you can search for the nearest neighbouring sentence to your input sentence. We use cosine distance as our distance metric. To do so, we use the command nnSent and the input should be 1 sentence per line:

./fasttext nnSent model.bin corpora [k] 

k is optional and is the number of nearest sentences that you want to output.

For the analogiesSent, the user inputs 3 sentences A,B and C and finds a sentence from the corpora which is the closest to D in the A:B::C:D analogy pattern.

./fasttext analogiesSent model.bin corpora [k]

k is optional and is the number of nearest sentences that you want to output.

Unigram Embeddings

For the purpose of generating word representations, we compared word embeddings obtained training sent2vec models with other word embedding models, including a novel method we refer to as CBOW char + word ngrams (cbow-c+w-ngrams). This method augments fasttext char augmented CBOW with word n-grams. You can see the full comparison of results in this paper.

Extracting Word Embeddings from Pre-Trained Models

If you have the Cython wrapper installed, some functionalities allow you to play with word embeddings obtained from sent2vec or cbow-c+w-ngrams:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin') # The model can be sent2vec or cbow-c+w-ngrams
vocab = model.get_vocabulary() # Return a dictionary with words and their frequency in the corpus
uni_embs, vocab = model.get_unigram_embeddings() # Return the full unigram embedding matrix
uni_embs = model.embed_unigrams(['dog', 'cat']) # Return unigram embeddings given a list of unigrams

Asking for a unigram embedding not present in the vocabulary will return a zero vector in case of sent2vec. The cbow-c+w-ngrams method will be able to use the sub-character ngrams to infer some representation.

Downloading Pre-Trained Models

Coming soon.

Train a CBOW Character and Word Ngrams Model

Very similar to the sent2vec instructions. A plausible command would be:

./fasttext cbow-c+w-ngrams -input wiki_sentences.txt -output my_model -lr 0.05 -dim 300 -ws 10 -epoch 9 -maxVocabSize 750000 -thread 20 -numCheckPoints 20 -t 0.0001 -neg 5 -bucket 4000000 -bucketChar 2000000 -wordNgrams 3 -minn 3 -maxn 6

References

When using this code or some of our pre-trained models for your application, please cite the following paper for sentence embeddings:

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018

@inproceedings{pgj2017unsup,
  title = {{Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features}},
  author = {Pagliardini, Matteo and Gupta, Prakhar and Jaggi, Martin},
  booktitle={NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics},
  year={2018}
}

For word embeddings:

Prakhar Gupta, Matteo Pagliardini, Martin Jaggi, Better Word Embeddings by Disentangling Contextual n-Gram Information NAACL 2019

@inproceedings{DBLP:conf/naacl/GuptaPJ19,
  author    = {Prakhar Gupta and
               Matteo Pagliardini and
               Martin Jaggi},
  title     = {Better Word Embeddings by Disentangling Contextual n-Gram Information},
  booktitle = {{NAACL-HLT} {(1)}},
  pages     = {933--939},
  publisher = {Association for Computational Linguistics},
  year      = {2019}
}

More Repositories

1

ML_course

EPFL Machine Learning Course, Fall 2024
Jupyter Notebook
1,254
star
2

OptML_course

EPFL Course - Optimization for Machine Learning - CS-439
Jupyter Notebook
1,122
star
3

attention-cnn

Source code for "On the Relationship between Self-Attention and Convolutional Layers"
Python
1,073
star
4

landmark-attention

Landmark Attention: Random-Access Infinite Context Length for Transformers
Python
258
star
5

federated-learning-public-code

Python
157
star
6

disco

DISCO is a code-free and installation-free browser platform that allows any non-technical user to collaboratively train machine learning models without sharing any private data.
TypeScript
152
star
7

collaborative-attention

Code for Multi-Head Attention: Collaborate Instead of Concatenate
Python
148
star
8

powersgd

Practical low-rank gradient compression for distributed optimization: https://arxiv.org/abs/1905.13727
Python
137
star
9

dynamic-sparse-flash-attention

Jupyter Notebook
129
star
10

DenseFormer

Python
74
star
11

llm-baselines

Python
68
star
12

ChocoSGD

Decentralized SGD and Consensus with Communication Compression: https://arxiv.org/abs/1907.09356
Python
59
star
13

sparsifiedSGD

Sparsified SGD with Memory: https://arxiv.org/abs/1809.07599
Jupyter Notebook
54
star
14

optML-pku

summer school materials
42
star
15

LocalSGD-Code

Python
42
star
16

error-feedback-SGD

SGD with compressed gradients and error-feedback: https://arxiv.org/abs/1901.09847
Jupyter Notebook
28
star
17

interpret-lm-knowledge

Extracting knowledge graphs from language models as a diagnostic benchmark of model performance (NeurIPS XAI 2021).
Jupyter Notebook
22
star
18

byzantine-robust-optimizer

Learning from history for Byzantine Robustness
Jupyter Notebook
21
star
19

Bi-Sent2Vec

Robust Cross-lingual Embeddings from Parallel Sentences
C++
20
star
20

opt-summerschool

Short Course on Optimization for Machine Learning - Slides and Practical Labs - DS3 Data Science Summer School, June 24 to 28, 2019, Paris, France
Jupyter Notebook
20
star
21

cola

CoLa - Decentralized Linear Learning: https://arxiv.org/abs/1808.04883
Python
18
star
22

opt-shortcourse

Short Course on Optimization for Machine Learning - Slides and Practical Lab - Pre-doc Summer School on Learning Systems, July 3 to 7, 2017, Zürich, Switzerland
Jupyter Notebook
18
star
23

powergossip

Code for "Practical Low-Rank Communication Compression in Decentralized Deep Learning"
Python
15
star
24

byzantine-robust-noniid-optimizer

Python
15
star
25

X2Static

X2Static embeddings
Python
12
star
26

kubernetes-setup

MLO group setup for kubernetes cluster
Dockerfile
12
star
27

topology-in-decentralized-learning

Code related to ’Beyond spectral gap: The role of the topology in decentralized learning‘.
Python
10
star
28

quasi-global-momentum

Python
10
star
29

relaysgd

Code for the paper “RelaySum for Decentralized Deep Learning on Heterogeneous Data”
Jupyter Notebook
10
star
30

piecewise-affine-multiplication

Python
7
star
31

rotational-optimizers

Python
6
star
32

byzantine-robust-decentralized-optimizer

Jupyter Notebook
6
star
33

uncertainity-estimation

Code for the paper “The Peril of Popular Deep Learning Uncertainty Estimation Methods”
Jupyter Notebook
6
star
34

getting-started

Python
6
star
35

text_to_image_generation

Python
5
star
36

easy-summary

difficulty-guided text summarization
Python
5
star
37

FeAI

Federated Learning with TensorFlow.js
Vue
4
star
38

ghost-noise

Python
3
star
39

autoTrain

Open Challenge - Automatic Training for Deep Learning
Python
3
star
40

pax

JAX-like API for PyTorch
Python
3
star
41

personalized-collaborative-llms

Python
2
star
42

phantomedicus

MedSurge: medical survey generator
Jupyter Notebook
1
star