• Stars
    star
    3,238
  • Rank 13,311 (Top 0.3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 3 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

SimCSE: Simple Contrastive Learning of Sentence Embeddings

This repository contains the code and pre-trained models for our paper SimCSE: Simple Contrastive Learning of Sentence Embeddings.

**************************** Updates ****************************

Quick Links

Overview

We propose a simple contrastive learning framework that works with both unlabeled and labeled data. Unsupervised SimCSE simply takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise. Our supervised SimCSE incorporates annotated pairs from NLI datasets into contrastive learning by using entailment pairs as positives and contradiction pairs as hard negatives. The following figure is an illustration of our models.

Getting Started

We provide an easy-to-use sentence embedding tool based on our SimCSE model (see our Wiki for detailed usage). To use the tool, first install the simcse package from PyPI

pip install simcse

Or directly install it from our code

python setup.py install

Note that if you want to enable GPU encoding, you should install the correct version of PyTorch that supports CUDA. See PyTorch official website for instructions.

After installing the package, you can load our model by just two lines of code

from simcse import SimCSE
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

See model list for a full list of available models.

Then you can use our model for encoding sentences into embeddings

embeddings = model.encode("A woman is reading.")

Compute the cosine similarities between two groups of sentences

sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
similarities = model.similarity(sentences_a, sentences_b)

Or build index for a group of sentences and search among them

sentences = ['A woman is reading.', 'A man is playing a guitar.']
model.build_index(sentences)
results = model.search("He plays guitar.")

We also support faiss, an efficient similarity search library. Just install the package following instructions here and simcse will automatically use faiss for efficient search.

WARNING: We have found that faiss did not well support Nvidia AMPERE GPUs (3090 and A100). In that case, you should change to other GPUs or install the CPU version of faiss package.

We also provide an easy-to-build demo website to show how SimCSE can be used in sentence retrieval. The code is based on DensePhrases' repo and demo (a lot of thanks to the authors of DensePhrases).

Model List

Our released models are listed as following. You can import these models by using the simcse package or using HuggingFace's Transformers.

Model Avg. STS
princeton-nlp/unsup-simcse-bert-base-uncased 76.25
princeton-nlp/unsup-simcse-bert-large-uncased 78.41
princeton-nlp/unsup-simcse-roberta-base 76.57
princeton-nlp/unsup-simcse-roberta-large 78.90
princeton-nlp/sup-simcse-bert-base-uncased 81.57
princeton-nlp/sup-simcse-bert-large-uncased 82.21
princeton-nlp/sup-simcse-roberta-base 82.52
princeton-nlp/sup-simcse-roberta-large 83.76

Note that the results are slightly better than what we have reported in the current version of the paper after adopting a new set of hyperparameters (for hyperparamters, see the training section).

Naming rules: unsup and sup represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.

Use SimCSE with Huggingface

Besides using our provided sentence embedding tool, you can also easily import our models with HuggingFace's transformers:

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL}).

Train SimCSE

In the following section, we describe how to train a SimCSE model by using our code.

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.7.1 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.7.1 should also work. For example, if you use Linux and CUDA11 (how to check CUDA version), install PyTorch by the following command,

pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

If you instead use CUDA <11 or CPU, install PyTorch by the following command,

pip install torch==1.7.1

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See our paper (Appendix B) for evaluation details.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Then come back to the root directory, you can evaluate any transformers-based pre-trained models using our evaluation code. For example,

python evaluation.py \
    --model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
    --pooler cls \
    --task_set sts \
    --mode test

which is expected to output the results in a tabular format:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.30 | 84.67 | 80.19 | 85.40 | 80.82 |    84.26     |      80.39      | 81.58 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Arguments for the evaluation script are as follows,

  • --model_name_or_path: The name or path of a transformers-based pre-trained checkpoint. You can directly use the models in the above table, e.g., princeton-nlp/sup-simcse-bert-base-uncased.
  • --pooler: Pooling method. Now we support
    • cls (default): Use the representation of [CLS] token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation). If you use supervised SimCSE, you should use this option.
    • cls_before_pooler: Use the representation of [CLS] token without the extra linear+activation. If you use unsupervised SimCSE, you should take this option.
    • avg: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa (paper), you should use this option.
    • avg_top2: Average embeddings of the last two layers.
    • avg_first_last: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best.
  • --mode: Evaluation mode
    • test (default): The default test mode. To faithfully reproduce our results, you should use this option.
    • dev: Report the development set results. Note that in STS tasks, only STS-B and SICK-R have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than the test mode (though numbers are slightly lower).
    • fasttest: It is the same as test, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks).
  • --task_set: What set of tasks to evaluate on (if set, it will override --tasks)
    • sts (default): Evaluate on STS tasks, including STS 12~16, STS-B and SICK-R. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings.
    • transfer: Evaluate on transfer tasks.
    • full: Evaluate on both STS and transfer tasks.
    • na: Manually set tasks by --tasks.
  • --tasks: Specify which dataset(s) to evaluate on. Will be overridden if --task_set is not na. See the code for a full list of tasks.

Training

Data

For unsupervised SimCSE, we sample 1 million sentences from English Wikipedia; for supervised SimCSE, we use the SNLI and MNLI datasets. You can run data/download_wiki.sh and data/download_nli.sh to download the two datasets.

Training scripts

We provide example training scripts for both unsupervised and supervised SimCSE. In run_unsup_example.sh, we provide a single-GPU (or CPU) example for the unsupervised version, and in run_sup_example.sh we give a multiple-GPU example for the supervised version. Both scripts call train.py for training. We explain the arguments in following:

  • --train_file: Training file path. We support "txt" files (one line for one sentence) and "csv" files (2-column: pair data with no hard negative; 3-column: pair data with one corresponding hard negative instance). You can use our provided Wikipedia or NLI data, or you can use your own data with the same format.
  • --model_name_or_path: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased, bert-large-uncased, etc.) and RoBERTa-based models (RoBERTa-base, RoBERTa-large, etc.).
  • --temp: Temperature for the contrastive loss.
  • --pooler_type: Pooling method. It's the same as the --pooler_type in the evaluation part.
  • --mlp_only_train: We have found that for unsupervised SimCSE, it works better to train the model with MLP layer but test the model without it. You should use this argument when training unsupervised SimCSE models.
  • --hard_negative_weight: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value).
  • --do_mlm: Whether to use the MLM auxiliary objective. If True:
    • --mlm_weight: Weight for the MLM objective.
    • --mlm_probability: Masking rate for the MLM objective.

All the other arguments are standard Huggingface's transformers training arguments. Some of the often-used arguments are: --output_dir, --learning_rate, --per_device_train_batch_size. In our example scripts, we also set to evaluate the model on the STS-B development set (need to download the dataset following the evaluation section) and save the best checkpoint.

For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

Hyperparameters

We use the following hyperparamters for training SimCSE:

Unsup. BERT Unsup. RoBERTa Sup.
Batch size 64 512 512
Learning rate (base) 3e-5 1e-5 5e-5
Learning rate (large) 1e-5 3e-5 1e-5

Convert models

Our saved checkpoints are slightly different from Huggingface's pre-trained checkpoints. Run python simcse_to_huggingface.py --path {PATH_TO_CHECKPOINT_FOLDER} to convert it. After that, you can evaluate it by our evaluation code or directly use it out of the box.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Tianyu ([email protected]) and Xingcheng ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use SimCSE in your work:

@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}

SimCSE Elsewhere

We thank the community's efforts for extending SimCSE!

More Repositories

1

SWE-agent

SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It solves 12.29% of bugs in the SWE-bench evaluation set and takes just 1.5 minutes to run.
Python
10,387
star
2

tree-of-thought-llm

[NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Python
4,170
star
3

SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Python
1,228
star
4

MeZO

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333
Python
975
star
5

PURE

[NAACL 2021] A Frustratingly Easy Approach for Entity and Relation Extraction https://arxiv.org/abs/2010.12812
Python
763
star
6

LM-BFF

[ACL 2021] LM-BFF: Better Few-shot Fine-tuning of Language Models https://arxiv.org/abs/2012.15723
Python
707
star
7

DensePhrases

[ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval Learns Passage Retrieval, Too https://arxiv.org/abs/2012.12624
Python
593
star
8

LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Python
439
star
9

ALCE

[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627
Python
380
star
10

AutoCompressors

[EMNLP 2023] Adapting Language Models to Compress Long Contexts
Python
227
star
11

LESS

Preprint: Less: Selecting Influential Data for Targeted Instruction Tuning
Jupyter Notebook
208
star
12

WebShop

[NeurIPS 2022] πŸ›’WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
Python
201
star
13

TRIME

[EMNLP 2022] Training Language Models with Memory Augmentation https://arxiv.org/abs/2205.12674
Python
185
star
14

CoFiPruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
Python
180
star
15

intercode

[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
Python
168
star
16

OptiPrompt

[NAACL 2021] Factual Probing Is [MASK]: Learning vs. Learning to Recall https://arxiv.org/abs/2104.05240
Python
167
star
17

TransformerPrograms

[NeurIPS 2023] Learning Transformer Programs
Python
146
star
18

EntityQuestions

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers https://arxiv.org/abs/2109.08535
Python
124
star
19

DinkyTrain

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration πŸšƒ
Python
108
star
20

CEPE

Preprint: Long-Context Language Modeling with Parallel Encodings
Python
99
star
21

QuRating

Selecting High-Quality Data for Training Language Models
Python
85
star
22

NLProofS

EMNLP 2022: Generating Natural Language Proofs with Verifier-Guided Search https://arxiv.org/abs/2205.12443
Python
79
star
23

LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
Python
74
star
24

MQuAKE

[EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
Jupyter Notebook
73
star
25

MADE

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering
Python
71
star
26

LM-Kernel-FT

A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643
Python
68
star
27

USACO

Can Language Models Solve Olympiad Programming?
Python
66
star
28

calm-textgame

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games
Python
62
star
29

c-sts

[EMNLP 2023] C-STS: Conditional Semantic Textual Similarity
Python
59
star
30

ShortcutGrammar

EMNLP 2022: Finding Dataset Shortcuts with Grammar Induction https://arxiv.org/abs/2210.11560
Jupyter Notebook
57
star
31

DataMUX

[NeurIPS 2022] DataMUX: Data Multiplexing for Neural Networks
Jupyter Notebook
57
star
32

EvalConvQA

[ACL 2022] Ditch the Gold Standard: Re-evaluating Conversational Question Answering
Python
44
star
33

Collie

[ICLR 2024] COLLIE: Systematic Construction of Constrained Text Generation Tasks
Jupyter Notebook
44
star
34

MABEL

EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975
Python
36
star
35

rationale-robustness

NAACL 2022: Can Rationalization Improve Robustness? https://arxiv.org/abs/2204.11790
Python
26
star
36

InstructEval

Evaluation suite for the systematic evaluation of instruction selection methods.
Jupyter Notebook
23
star
37

LM-Science-Tutor

Python
22
star
38

WhatICLLearns

[ACL 2023 Findings] What In-Context Learning β€œLearns” In-Context: Disentangling Task Recognition and Task Learning
Python
21
star
39

Cognac

Repo for paper: Controllable Text Generation with Language Constraints
Python
19
star
40

PTP

Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073
Python
18
star
41

semsup

Semantic Supervision: Enabling Generalization over Output Spaces
Python
16
star
42

datamux-pretraining

MUX-PLMs: Pretraining LMs with Data Multiplexing
Python
14
star
43

corpus-poisoning

[EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156
Python
14
star
44

XTX

[ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration in Text Games
Python
13
star
45

SRL-NLC

Safe Reinforcement Learning with Natural Language Constraints
13
star
46

MultilingualAnalysis

Repository for the paper titled: "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer"
Python
13
star
47

blindfold-textgame

[NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in Text Game Agents
Python
12
star
48

dyck-transformer

[ACL 2021] Self-Attention Networks Can Process Bounded Hierarchical Languages
Python
11
star
49

metric-wsd

NAACL'2021: Non-Parametric Few-Shot Learning for Word Sense Disambiguation
Python
10
star
50

align-mlm

Python
10
star
51

semsup-xc

SemSup-XC: Semantic Supervision for Extreme Classification
Jupyter Notebook
10
star
52

lwm

We develop world models that can be adapted with natural language. Intergrating these models into artificial agents allows humans to effectively control these agents through verbal communication.
Python
7
star
53

CARETS

Python
6
star
54

Heuristic-Core

The code accompanying the paper "The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models" - https://arxiv.org/abs/2403.03942
Python
5
star
55

SPARTAN

SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers
Python
5
star
56

attribute-tagging

[LaReL 2022] Towards an Enhanced, Faithful, and Adaptable Web Interaction Environment
Python
4
star
57

NegotiationToM

Code release for Improving Dialog Systems for Negotiation with Personality Modeling.
Python
3
star
58

MoQA

Python
3
star
59

il-scaling-in-games

Official code repo of "Scaling Laws for Imitation Learning in NetHack"
Python
3
star