• Stars
    star
    258
  • Rank 157,295 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created 11 months ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

QMoE

This repository contains the full code of the paper QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.

It is organized as follows:

  • datautils.py: utilities for dataset loading
  • gptq.py: robust batch-implementation of GPTQ
  • quant.py: quantization utilities
  • sub1.py: efficient inference of compressed models
  • sub1_cuda_kernel.cu: CUDA kernels
  • switch.py: the efficient QMoE compression framework
  • test.py: per-layer benchmarks and ideal compression rates

Dependencies

The project was developed with:

  • torch==2.0.0+cu117
  • transformers==4.28.0
  • datasets==2.10.1
  • CUDA 11.4 GPU drivers

CUDA kernels for compressed storage and inference can be installed via:

python setup_cuda.py install

Usage

Now follows a list of sample commands for running different experiments.

# BF16 baseline eval on C4 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 
# BF16 baseline eval on additional datasets 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval
# ternary round to nearest baseline 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest 

# ternary compression with QMoE, saving the compressed model for later inference
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL
# 2-bit compression with QMoE
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000

# test kernels and compute ideal compression rates 
CUDA_VISIBLE_DEVICES=0 python test.py
# run per-layer benchmarks
CUDA_VISIBLE_DEVICES=0 python test.py --benchmark

# run eval of stored compressed model
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128 
# run end-to-end benchmark
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128
# run simulated end-to-end benchmark for BF16
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul

In general, you can pass google/switch-large-128 and google/switch-c-2048 to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.

Always specify CUDA_VISIBLE_DEVICES since some commands, like sub1.py, will otherwise attempt to use all available GPUs.

Compressed Models

Our models in compressed custom QMoE format are available on Hugging Face: base-128, large-128 and c-2048. To use them, clone the repository and then simply pass their path to sub1.py.

Cite

If you found this work useful, please consider citing:

@article{frantar-qmoe,
  title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models}
  author={Elias Frantar and Dan Alistarh},
  year={2023},
  journal={arXiv preprint, arxiv:2310.16795}
}

More Repositories

1

gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
Python
1,872
star
2

sparsegpt

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
Python
694
star
3

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Python
557
star
4

PanzaMail

Python
254
star
5

QUIK

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference
C++
167
star
6

OBC

Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
Python
95
star
7

Sparse-Marlin

Boosting 4-bit inference kernels with 2:4 Sparsity
Cuda
46
star
8

WoodFisher

Code accompanying the NeurIPS 2020 paper: WoodFisher (Singh & Alistarh, 2020)
Python
45
star
9

SparseFinetuning

Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
Python
36
star
10

RoSA

Python
29
star
11

QIGen

Repository for CPU Kernel Generation for LLM Inference
Python
25
star
12

ACDC

Code for reproducing "AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks" (NeurIPS 2021)
Python
20
star
13

spdy

Code for ICML 2022 paper "SPDY: Accurate Pruning with Speedup Guarantees"
Python
18
star
14

M-FAC

Efficient reference implementations of the static & dynamic M-FAC algorithms (for pruning and optimization)
Python
16
star
15

torch_cgx

Pytorch distributed backend extension with compression support
C++
14
star
16

sparseprop

C++
13
star
17

peft-rosa

A fork of the PEFT library, supporting Robust Adaptation (RoSA)
Python
13
star
18

sparse-imagenet-transfer

Code for reproducing the results in "How Well do Sparse Imagenet Models Transfer?", presented at CVPR 2022
Python
8
star
19

CrAM

Code for reproducing the results from "CrAM: A Compression-Aware Minimizer" accepted at ICLR 2023
Python
8
star
20

MicroAdam

This repository contains code for the MicroAdam paper.
Python
8
star
21

spops

C++
6
star
22

ISTA-DASLab-Optimizers

Python
5
star
23

EFCP

The repository contains code to reproduce the experiments from our paper Error Feedback Can Accurately Compress Preconditioners available below:
Python
4
star
24

pruned-vision-model-bias

Code for reproducing the paper "Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures"
Jupyter Notebook
4
star
25

Mathador-LM

Code for the paper "Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on LLMs".
Python
4
star
26

CAP

Repository for Correlation Aware Prune (NeurIPS23) source and experimental code
Python
4
star
27

evolution-strategies

Python
2
star
28

TACO4NLP

Task aware compression for various NLP tasks
Python
2
star
29

smart-quantizer

Repository for Vitaly's implementation of the distribution-adaptive quantizer
Python
1
star
30

ZipLM

Code for the NeurIPS 2023 paper: "ZipLM: Inference-Aware Structured Pruning of Language Models".
1
star
31

QRGD

Repository for the implementation of "Distributed Principal Component Analysis with Limited Communication" (Alimisis et al., NeurIPS 2021). Parts of this code were originally based on code from "Communication-Efficient Distributed PCA by Riemannian Optimization" (Huang and Pan, ICML 2020).
MATLAB
1
star
32

KDVR

Code for the experiments in Knowledge Distillation Performs Partial Variance Reduction, NeurIPS 2023
Python
1
star