• Stars
    star
    111
  • Rank 314,510 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 2 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration πŸšƒ


This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers more flexibility when using our training scripts, while also making it easier to adapt our code contributions into other projects.

Why DinkyTrain?

The Dinky runs between Princeton Junction and Princeton and is the shortest scheduled commuter rail line in the United States. We also aim to make pre-training short and accessible to everyone.

Our Contributions

  • DeepSpeed transformer kernel integration
  • A training recipe for efficient MLM pre-training
  • An easy-to-follow guideline of using fairseq for MLM pre-training.

Other fairseq features:

See the fairseq repo and its documentation for more details on how to use and extend fairseq.

DinkyTrain for Efficient MLM Pre-training

Quick Links

Overview

You can reproduce the pre-training experiments of our recent paper Should You Mask 15% in Masked Language Modeling?, where we find that higher masking rates can lead to more efficient pre-training.

Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • To install fairseq and develop locally:
git clone https://github.com/princeton-nlp/DinkyTrain.git
cd DinkyTrain
pip install --editable ./
  • For faster training (FP16) install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For faster training (DeepSpeed cuda kernel) install DeepSpeed library and compile the DeepSpeed kernel
DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 pip install deepspeed
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Trouble-shooting:

  • If using lower version of Python, you might encounter import problems with importlib.metadata. Try pip install importlib-metadata.
  • To install apex and deepspeed, you will need nvcc (CUDA compiler).
  • When installing apex, if you encounter the error Cuda extensions are bing compiled with a version of Cuda that does not match ..., go to setup.py and comment out the line that raised the error (at your own risk).
  • Both apex and deepspeed installation require a high gcc version to support c++14. If you encounter relevant errors, update your gcc.

Data Pre-processing

Tokenization: First, download the GPT2 BPE vocabulary:

wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe

Then, tokenize your raw data:

python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json gpt2_bpe/encoder.json \
    --vocab-bpe gpt2_bpe/vocab.bpe \
    --inputs ${SPLIT}.raw \
    --outputs ${SPLIT}.bpe \
    --keep-empty \
    --workers 8

Finally, index and binarize your data:

fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref ${TRAIN_SPLIT}.bpe \
    --validpref ${VALID_SPLIT}.bpe \
    --testpref ${TEST_SPLIT}.bpe \
    --destdir output-bin \
    --workers 8

Alternatively: Use our pre-processed data: We preprocessed Wikipedia+BookCorpus and shared it on Huggingface dataset. It is ~22GB and contains two epochs of data, each epoch being sliced into 8 shards. You can download it using git:

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/wikibook_fairseq_format

Pre-training

Use our script for efficient pre-training

GPU={number of GPUs} DATA_DIR={data path} [DEEPSPEED=1] bash run_efficient_mlm_recipe.sh

Flags explained

  • GPU: number of GPUs.
  • DATA_DIR: directory to the processed pre-training data. If you are using our preprocessed dataset, DATA_DIR should be:
DATA_DIR=$(seq 0 15 | sed -e 's/^/wikibook_fairseq_format\/bin-shard/' | sed -e 's/$/-8/' | paste -sd ':')
  • DEEPSPEED (optional): if set to 1, the DeepSpeed CUDA kernel will be used.

Please refer to the script for more hyperparameter choices.

Fine-tuning on GLUE and SQuAD

All our checkpoints can be converted to HuggingFace transformers models (see next section) and use the transformers package for fine-tuning. Fairseq also supports fine-tuning on GLUE.

First, download the preprocessed GLUE data (you can also process by yourself following the preprocess section above):

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/glue_fairseq_format

Then use the following script for fine-tuning

DATA_DIR={path to the data directory} \
TASK={glue task name (mnli qnli qqp rte sst2 mrpc cola stsb)} \
LR={learning rate} \
BSZ={batch size} \
EPOCHS={number of epochs} \
SEED={random seed} \
CKPT_DIR={checkpoint's directory} \
CKPT_NAME={checkpoint's name} \
[DEEPSPEED=1] bash finetune_glue.sh

For fine-tuning on SQuAD, please convert the models to HuggingFace checkpoints following the next section and use HuggingFace's examples.

Convert to HuggingFace

We also provide conversion codes so that you can easily turn Fairseq checkpoints into HuggingFace checkpoints. Usage:

cd scripts
[PRELAYERNORM=1] [FROM_DS=1] python convert_fs_ckpt_to_hf_ckpt.py --fr {fairseq checkpoint} --to {huggingface checkpoint path} --hf_model_config {roberta-base/roberta-large}

Flags explained:

  • PRELAYERNORM=1: Using pre layer-norm (default is post layer-norm).
  • FROM_DS=1: The Fairseq checkpoint uses DeepSpeed's cuda kernel.
  • --fr: The path to the Fairseq checkpoint.
  • --to: The path you want to save the HuggingFace checkpoint to.
  • --hf_model_config: roberta-base or roberta-large.

IMPORTANT: all our models use pre layer norm, which is not supported by HuggingFace yet. To use it, import the model class from huggingface/modeling_roberta_prelayernorm.py. For example:

from huggingface.modeling_roberta_prelayernorm import RobertaForSequenceClassification

For more configuration, please refer to convert_fs_ckpt_to_hf_ckpt.py.

Model List

Here are the HuggingFace checkpoints of our models in the paper Should You Mask 15% in Masked Language Modeling. Results are development set performance.

Model MNLI QNLI QQP SST-2
princeton-nlp/efficient_mlm_m0.15 84.2 90.9 87.8 93.3
princeton-nlp/efficient_mlm_m0.20 84.1 91.3 87.9 92.7
princeton-nlp/efficient_mlm_m0.30 84.2 91.6 88.0 93.0
princeton-nlp/efficient_mlm_m0.40 84.5 91.6 88.1 92.8
princeton-nlp/efficient_mlm_m0.50 84.1 91.1 88.1 92.7
princeton-nlp/efficient_mlm_m0.60 83.2 90.7 87.8 92.6
princeton-nlp/efficient_mlm_m0.70 82.3 89.4 87.5 91.9
princeton-nlp/efficient_mlm_m0.80 80.8 87.9 87.1 90.5
princeton-nlp/efficient_mlm_m0.15-801010 83.7 90.4 87.8 93.2
princeton-nlp/efficient_mlm_m0.40-801010 84.3 91.2 87.9 93.0

We also offer the original (deepspeed) fairseq checkpoints here.

Bugs or Questions?

If you have any questions, or encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@article{wettig2022should,
   title={Should You Mask 15% in Masked Language Modeling?},
   author={Wettig, Alexander and Gao, Tianyu and Zhong, Zexuan and Chen, Danqi},
   book={arXiv preprint arXiv:2202.08005},
   year={2022}
}

Acknowledgment

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.

  • Our efficient training recipe is based on the following paper:

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to train BERT with an academic budget. In Empirical Methods in Natural Language Processing (EMNLP), pages 10644–10652.

More Repositories

1

SWE-agent

SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024]
Python
13,504
star
2

tree-of-thought-llm

[NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Python
4,726
star
3

SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
Python
3,381
star
4

SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Python
1,846
star
5

MeZO

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333
Python
1,031
star
6

PURE

[NAACL 2021] A Frustratingly Easy Approach for Entity and Relation Extraction https://arxiv.org/abs/2010.12812
Python
788
star
7

LM-BFF

[ACL 2021] LM-BFF: Better Few-shot Fine-tuning of Language Models https://arxiv.org/abs/2012.15723
Python
714
star
8

SimPO

SimPO: Simple Preference Optimization with a Reference-Free Reward
Python
672
star
9

DensePhrases

[ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval Learns Passage Retrieval, Too https://arxiv.org/abs/2012.12624
Python
605
star
10

LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Python
546
star
11

ALCE

[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627
Python
450
star
12

LESS

[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
Jupyter Notebook
354
star
13

AutoCompressors

[EMNLP 2023] Adapting Language Models to Compress Long Contexts
Python
273
star
14

WebShop

[NeurIPS 2022] πŸ›’WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
Python
264
star
15

TRIME

[EMNLP 2022] Training Language Models with Memory Augmentation https://arxiv.org/abs/2205.12674
Python
192
star
16

intercode

[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
Python
191
star
17

CoFiPruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
Python
188
star
18

OptiPrompt

[NAACL 2021] Factual Probing Is [MASK]: Learning vs. Learning to Recall https://arxiv.org/abs/2104.05240
Python
167
star
19

TransformerPrograms

[NeurIPS 2023] Learning Transformer Programs
Python
157
star
20

EntityQuestions

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers https://arxiv.org/abs/2109.08535
Python
139
star
21

QuRating

[ICML 2024] Selecting High-Quality Data for Training Language Models
Python
137
star
22

CEPE

[ACL 2024] Long-Context Language Modeling with Parallel Encodings
Python
135
star
23

LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
Python
108
star
24

MQuAKE

[EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
Jupyter Notebook
97
star
25

USACO

Can Language Models Solve Olympiad Programming?
Python
96
star
26

ProLong

Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"
Python
82
star
27

NLProofS

EMNLP 2022: Generating Natural Language Proofs with Verifier-Guided Search https://arxiv.org/abs/2205.12443
Python
81
star
28

CharXiv

[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Python
72
star
29

MADE

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering
Python
70
star
30

LM-Kernel-FT

A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643
Python
68
star
31

c-sts

[EMNLP 2023] C-STS: Conditional Semantic Textual Similarity
Python
66
star
32

calm-textgame

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games
Python
65
star
33

DataMUX

[NeurIPS 2022] DataMUX: Data Multiplexing for Neural Networks
Jupyter Notebook
58
star
34

ShortcutGrammar

EMNLP 2022: Finding Dataset Shortcuts with Grammar Induction https://arxiv.org/abs/2210.11560
Jupyter Notebook
57
star
35

LitSearch

A Retrieval Benchmark for Scientific Literature Search
Python
54
star
36

Collie

[ICLR 2024] COLLIE: Systematic Construction of Constrained Text Generation Tasks
Jupyter Notebook
52
star
37

EvalConvQA

[ACL 2022] Ditch the Gold Standard: Re-evaluating Conversational Question Answering
Python
45
star
38

HELMET

The HELMET Benchmark
Python
42
star
39

MABEL

EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975
Python
37
star
40

LM-Science-Tutor

Python
34
star
41

rationale-robustness

NAACL 2022: Can Rationalization Improve Robustness? https://arxiv.org/abs/2204.11790
Python
26
star
42

PTP

Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073
Python
25
star
43

corpus-poisoning

[EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156
Python
25
star
44

InstructEval

[NAACL 2024 Findings] Evaluation suite for the systematic evaluation of instruction selection methods.
Jupyter Notebook
23
star
45

Edge-Pruning

Code and data for the paper "Finding Transformer Circuits with Edge Pruning".
Python
22
star
46

WhatICLLearns

[ACL 2023 Findings] What In-Context Learning β€œLearns” In-Context: Disentangling Task Recognition and Task Learning
Python
21
star
47

Cognac

Repo for paper: Controllable Text Generation with Language Constraints
Python
19
star
48

lwm

We develop world models that can be adapted with natural language. Intergrating these models into artificial agents allows humans to effectively control these agents through verbal communication.
Python
18
star
49

ELIZA-Transformer

Representing Rule-based Chatbots with Transformers
Python
18
star
50

semsup

Semantic Supervision: Enabling Generalization over Output Spaces
Python
16
star
51

benign-data-breaks-safety

Python
16
star
52

SRL-NLC

Safe Reinforcement Learning with Natural Language Constraints
14
star
53

datamux-pretraining

MUX-PLMs: Pretraining LMs with Data Multiplexing
Python
14
star
54

XTX

[ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration in Text Games
Python
13
star
55

MultilingualAnalysis

Repository for the paper titled: "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer"
Python
13
star
56

dyck-transformer

[ACL 2021] Self-Attention Networks Can Process Bounded Hierarchical Languages
Python
12
star
57

blindfold-textgame

[NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in Text Game Agents
Python
12
star
58

align-mlm

Python
11
star
59

metric-wsd

NAACL'2021: Non-Parametric Few-Shot Learning for Word Sense Disambiguation
Python
10
star
60

semsup-xc

SemSup-XC: Semantic Supervision for Extreme Classification
Jupyter Notebook
10
star
61

Heuristic-Core

[ACL 2024] The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models - https://arxiv.org/abs/2403.03942
Python
9
star
62

CopyCat

Python
9
star
63

NegotiationToM

Code release for Improving Dialog Systems for Negotiation with Personality Modeling.
Python
7
star
64

CARETS

Python
6
star
65

SPARTAN

SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers
Python
5
star
66

il-scaling-in-games

Official code repo of "Scaling Laws for Imitation Learning in Single-Agent Games"
Python
5
star
67

attribute-tagging

[LaReL 2022] Towards an Enhanced, Faithful, and Adaptable Web Interaction Environment
Python
4
star
68

MoQA

Python
3
star