• Stars
    star
    180
  • Rank 206,480 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408

β˜• CoFiPruning: Structured Pruning Learns Compact and Accurate Models

This repository contains the code and pruned models for our ACL'22 paper Structured Pruning Learns Compact and Accurate Models. Our talk slides can be found here. Numerical results in the paper can be found here.

**************************** Updates ****************************

  • 05/09/2022: We release the pruned model checkpoints on RTE, MRPC and CoLA!
  • 04/01/2022: We released our paper along with pruned model checkpoints on SQuAD, SST-2, QNLI and MNLI. Check it out!

Quick Links

Overview

We propose CoFiPruning, a task-specific, structured pruning approach (Coarse and Fine-grained Pruning) and show that structured pruning can achieve highly compact subnetworks and obtain large speedups and competitive accuracy as distillation approaches, while requiring much less computation. Our key insight is to jointly prune coarse-grained units (e.g., self-attention or feed-forward layers) and fine-grained units (e.g., heads, hidden dimensions) simultaneously. Different from existing works, our approach controls the pruning decision of every single parameter by multiple masks of different granularity. This is the key to large compression, as it allows the greatest flexibility of pruned structures and eases the optimization compared to only pruning small units. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization.

Main Results

We show the main results of CoFiPruning along with results of popular pruning and distillation methods including Block Pruning, DynaBERT, DistilBERT and TinyBERT. Please see more detailed results in our paper.

Model List

Our released models are listed as following. You can download these models with the following links. We use a batch size of 128 and V100 32GB GPUs for speedup evaluation. We show F1 score for SQuAD and accuracy score for GLUE datasets. s60 denotes that the sparsity of the model is roughly 60%.

model name task sparsity speedup score
princeton-nlp/CoFi-MNLI-s60 MNLI 60.2% 2.1 Γ— 85.3
princeton-nlp/CoFi-MNLI-s95 MNLI 94.3% 12.1 Γ— 80.6
princeton-nlp/CoFi-QNLI-s60 QNLI 60.3% 2.1 Γ— 91.8
princeton-nlp/CoFi-QNLI-s95 QNLI 94.5% 12.1 Γ— 86.1
princeton-nlp/CoFi-SST2-s60 SST-2 60.1% 2.1 Γ— 93.0
princeton-nlp/CoFi-SST2-s95 SST-2 94.5% 12.2 Γ— 90.4
princeton-nlp/CoFi-SQuAD-s60 SQuAD 59.8% 2.0 Γ— 89.1
princeton-nlp/CoFi-SQuAD-s93 SQuAD 92.4% 8.7 Γ— 82.6
princeton-nlp/CoFi-RTE-s60 RTE 60.2% 2.0 x 72.6
princeton-nlp/CoFi-RTE-s96 RTE 96.2% 12.8 x 66.1
princeton-nlp/CoFi-CoLA-s60 CoLA 60.4% 2.0 x 60.4
princeton-nlp/CoFi-CoLA-s95 CoLA 95.1% 12.3 x 38.9
princeton-nlp/CoFi-MRPC-s60 MRPC 61.5% 2.0 x 86.8
princeton-nlp/CoFi-MRPC-s95 MRPC 94.9% 12.2 x 83.6

You can use these models with the huggingface interface:

from models.modeling_bert import CoFiBertForSequenceClassification
model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95") 
output = model(**inputs)

Train CoFiPruning

In the following section, we provide instructions on training CoFi with our code.

Requirements

Try runing the following script to install the dependencies.

Please define a lower version of transformers, because the latest version seems seems do not have hf_bucket_url in transformers.file_utils

pip install -r requirements.txt

Training

Training scripts

We provide example training scripts for training with CoFiPruning with different combination of training units and objectives in scripts/run_CoFi.sh. The script only supports single-GPU training and we explain the arguments in following:

  • --task_name: we support sequence classification tasks and extractive question answer tasks. You can input a glue task name, e.g., MNLI or use --train_file and --validation_file arguments with other tasks (supported by HuggingFace).
  • --ex_name_suffix: experiment name (for output dir)
  • --ex_cate: experiment category name (for output dir)
  • --pruning_type: we support all combinations of the following four types of pruning units. Default pruning type is structured_heads+structured_mlp+hidden+layer. Setting it to None falls back to standard fine-tuning.
    • structured_heads: head pruning
    • structured_mlp: mlp intermediate dimension pruning
    • hidden: hidden states pruning
    • layer: layer pruning
  • --target_sparsity: target sparsity of the pruned model
  • --distillation_path: the directory of the teacher model
  • --distillation_layer_loss_alpha: weight for layer distillation
  • --distillation_ce_loss_alpha: weight for cross entropy distillation
  • --layer_distill_version: we recommend using version 4 for small-sized datasets to impose an explicit restriction on layer orders but for relatively larger datasets, version 3 and version 4 do not make much difference. @zhangzhenyu13 found that randomly selecting teacher layers leads to more stable results, which is version 6. Please find this pull request for more details.
  • --sparsity_epsilon: the epsilon to relax the sparsity constraint. If set to be larger than 0, the training process will start saving models with a sparsity target_sparsity - sparsity_epislon. This is recommended to be set to be 0.01 when training with 0.95 sparsity to replicate our reported numbers, so that the models with a sparsity above 0.94 will be saved.

After pruning the model, the same script could be used for further fine-tuning the pruned model with following arguments:

  • --pretrained_pruned_model: directory of the pruned model
  • --learning_rate: learning rate of the fine-tuning stage Note that during fine-tuning stage, pruning_type should be set to None.

An example for training (pruning) is as follows:

TASK=MNLI
SUFFIX=sparsity0.95
EX_CATE=CoFi
PRUNING_TYPE=structured_heads+structured_mlp+hidden+layer
SPARSITY=0.95
DISTILL_LAYER_LOSS_ALPHA=0.9
DISTILL_CE_LOSS_ALPHA=0.1
LAYER_DISTILL_VERSION=4
SPARSITY_EPSILON=0.01

bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $SPARSITY_EPSILON

An example for fine_tuning after pruning is as follows:

PRUNED_MODEL_PATH=$proj_dir/$TASK/$EX_CATE/${TASK}_${SUFFIX}/best
PRUNING_TYPE=None # Setting the pruning type to be None for standard fine-tuning.
LEARNING_RATE=3e-5

bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $SPARSITY_EPSILON [PRUNED_MODEL_PATH] $LEARNING_RATE

The training process will save the model with the best validation accuracy under $PRUNED_MODEL_PATH/best. And you can use the evaluation.py script for evaluation.

Evaluation

Our pruned models are served on Huggingface's model hub. You can use the script evalution.py to get the sparsity, inference time and development set results of a pruned model.

python evaluation.py [TASK] [MODEL_NAME_OR_DIR]

An example use of evaluating a sentence classification model is as follows:

python evaluation.py MNLI princeton-nlp/CoFi-MNLI-s95 

The expected output of the model is as follows:

Task: MNLI
Model path: princeton-nlp/CoFi-MNLI-s95
Model size: 4920106
Sparsity: 0.943
mnli/acc: 0.8055
seconds/example: 0.010151

Hyperparameters

We use the following hyperparamters for training CoFiPruning:

GLUE (small) GLUE (large) SQuAD
Batch size 32 32 16
Pruning learning rate 2e-5 2e-5 3e-5
Fine-tuning learning rate 1e-5, 2e-5, 3e-5 1e-5, 2e-5, 3e-5 1e-5, 2e-5, 3e-5
Layer distill. alpha 0.9, 0.7, 0.5 0.9, 0.7, 0.5 0.9, 0.7, 0.5
Cross entropy distill. alpha 0.1, 0.3, 0.5 0.1, 0.3, 0.5 0.1, 0.3, 0.5
Pruning epochs 100 20 20
Pre-finetuning epochs 4 1 1
Sparsity warmup epochs 20 2 2
Finetuning epochs 20 20 20

GLUE (small) denotes the GLUE tasks with a relatively smaller size including CoLA, STS-B, MRPC and RTE and GLUE (large) denotes the rest of the GLUE tasks including SST-2, MNLI, QQP and QNLI. Note that hyperparameter search is essential for small-sized datasets but is less important for large-sized datasets.

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Mengzhou ([email protected]) and Zexuan ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use CoFiPruning in your work:

@inproceedings{xia2022structured,
   title={Structured Pruning Learns Compact and Accurate Models},
   author={Xia, Mengzhou and Zhong, Zexuan and Chen, Danqi},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2022}
}

More Repositories

1

SWE-agent

SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It solves 12.29% of bugs in the SWE-bench evaluation set and takes just 1.5 minutes to run.
Python
10,387
star
2

tree-of-thought-llm

[NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Python
4,170
star
3

SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
Python
3,238
star
4

SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Python
1,228
star
5

MeZO

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333
Python
975
star
6

PURE

[NAACL 2021] A Frustratingly Easy Approach for Entity and Relation Extraction https://arxiv.org/abs/2010.12812
Python
763
star
7

LM-BFF

[ACL 2021] LM-BFF: Better Few-shot Fine-tuning of Language Models https://arxiv.org/abs/2012.15723
Python
710
star
8

DensePhrases

[ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval Learns Passage Retrieval, Too https://arxiv.org/abs/2012.12624
Python
593
star
9

LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Python
439
star
10

ALCE

[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627
Python
380
star
11

AutoCompressors

[EMNLP 2023] Adapting Language Models to Compress Long Contexts
Python
227
star
12

LESS

Preprint: Less: Selecting Influential Data for Targeted Instruction Tuning
Jupyter Notebook
208
star
13

WebShop

[NeurIPS 2022] πŸ›’WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
Python
201
star
14

TRIME

[EMNLP 2022] Training Language Models with Memory Augmentation https://arxiv.org/abs/2205.12674
Python
185
star
15

intercode

[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
Python
168
star
16

OptiPrompt

[NAACL 2021] Factual Probing Is [MASK]: Learning vs. Learning to Recall https://arxiv.org/abs/2104.05240
Python
167
star
17

TransformerPrograms

[NeurIPS 2023] Learning Transformer Programs
Python
150
star
18

EntityQuestions

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers https://arxiv.org/abs/2109.08535
Python
124
star
19

DinkyTrain

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration πŸšƒ
Python
108
star
20

CEPE

Preprint: Long-Context Language Modeling with Parallel Encodings
Python
99
star
21

QuRating

Selecting High-Quality Data for Training Language Models
Python
85
star
22

NLProofS

EMNLP 2022: Generating Natural Language Proofs with Verifier-Guided Search https://arxiv.org/abs/2205.12443
Python
79
star
23

LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
Python
74
star
24

MQuAKE

[EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
Jupyter Notebook
73
star
25

MADE

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering
Python
70
star
26

LM-Kernel-FT

A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643
Python
68
star
27

USACO

Can Language Models Solve Olympiad Programming?
Python
66
star
28

calm-textgame

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games
Python
62
star
29

c-sts

[EMNLP 2023] C-STS: Conditional Semantic Textual Similarity
Python
59
star
30

DataMUX

[NeurIPS 2022] DataMUX: Data Multiplexing for Neural Networks
Jupyter Notebook
57
star
31

ShortcutGrammar

EMNLP 2022: Finding Dataset Shortcuts with Grammar Induction https://arxiv.org/abs/2210.11560
Jupyter Notebook
57
star
32

EvalConvQA

[ACL 2022] Ditch the Gold Standard: Re-evaluating Conversational Question Answering
Python
44
star
33

Collie

[ICLR 2024] COLLIE: Systematic Construction of Constrained Text Generation Tasks
Jupyter Notebook
44
star
34

MABEL

EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975
Python
36
star
35

rationale-robustness

NAACL 2022: Can Rationalization Improve Robustness? https://arxiv.org/abs/2204.11790
Python
26
star
36

InstructEval

Evaluation suite for the systematic evaluation of instruction selection methods.
Jupyter Notebook
23
star
37

LM-Science-Tutor

Python
22
star
38

WhatICLLearns

[ACL 2023 Findings] What In-Context Learning β€œLearns” In-Context: Disentangling Task Recognition and Task Learning
Python
21
star
39

Cognac

Repo for paper: Controllable Text Generation with Language Constraints
Python
19
star
40

PTP

Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073
Python
18
star
41

semsup

Semantic Supervision: Enabling Generalization over Output Spaces
Python
16
star
42

datamux-pretraining

MUX-PLMs: Pretraining LMs with Data Multiplexing
Python
14
star
43

corpus-poisoning

[EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156
Python
14
star
44

XTX

[ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration in Text Games
Python
13
star
45

SRL-NLC

Safe Reinforcement Learning with Natural Language Constraints
13
star
46

MultilingualAnalysis

Repository for the paper titled: "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer"
Python
13
star
47

blindfold-textgame

[NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in Text Game Agents
Python
12
star
48

dyck-transformer

[ACL 2021] Self-Attention Networks Can Process Bounded Hierarchical Languages
Python
11
star
49

metric-wsd

NAACL'2021: Non-Parametric Few-Shot Learning for Word Sense Disambiguation
Python
10
star
50

align-mlm

Python
10
star
51

semsup-xc

SemSup-XC: Semantic Supervision for Extreme Classification
Jupyter Notebook
10
star
52

lwm

We develop world models that can be adapted with natural language. Intergrating these models into artificial agents allows humans to effectively control these agents through verbal communication.
Python
7
star
53

CARETS

Python
6
star
54

Heuristic-Core

The code accompanying the paper "The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models" - https://arxiv.org/abs/2403.03942
Python
5
star
55

SPARTAN

SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers
Python
5
star
56

attribute-tagging

[LaReL 2022] Towards an Enhanced, Faithful, and Adaptable Web Interaction Environment
Python
4
star
57

NegotiationToM

Code release for Improving Dialog Systems for Negotiation with Personality Modeling.
Python
3
star
58

MoQA

Python
3
star
59

il-scaling-in-games

Official code repo of "Scaling Laws for Imitation Learning in NetHack"
Python
3
star