• Stars
    star
    1,872
  • Rank 24,603 (Top 0.5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 2 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

GPTQ

This repository contains the code for the ICLR 2023 paper GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. The current release includes the following features:

  • An efficient implementation of the GPTQ algorithm: gptq.py
  • Compressing all models from the OPT and BLOOM families to 2/3/4 bits, including weight grouping: opt.py, bloom.py, zeroShot/
  • Evaluating the perplexity of quantized models on several language generation tasks: opt.py, bloom.py
  • Evaluating the performance of quantized models on several ZeroShot tasks: zeroShot/
  • A 3-bit quantized matrix full-precision vector product CUDA kernel: quant_cuda_kernel.cu, quant_cuda.cpp, setup_cuda.py
  • Benchmarking code for individual matrix-vector products and for language generation with quantized models: test_kernel.py, opt.py

New Features

Update July 2023:

  • Added --static-groups options which determines all group-grids in advance rather than dynamically during quantization, which has the effect that --act-order does not require any inference changes (that may cause slowdown) when used together with this option.

Together with the camera ready version of the paper we have added several updates to this repository:

  • Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag --new-eval.
  • Optimized 3bit kernels, which are considerably faster especially on the A100, e.g. 1.9x -> 3.25x generation speedup for OPT-175B; can be activated via --faster-kernel.
  • A minimal LlaMa integration (for more complete features see the GPTQ-for-LLaMA repository), which demonstrates two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

Here is a summary of LLaMa results:

Wiki2 PPL FP16 4bit-RTN 4bit-GPTQ 3bit-RTN 3bit-GPTQ 3g128-GPTQ
LLaMa-7B 5.68 6.29 6.09 25.54 8.07 6.61
LLaMa-13B 5.09 5.53 5.36 11.40 6.63 5.62
LLaMa-30B 4.10 4.54 4.45 14.89 5.69 4.80
LLaMa-65B 3.53 3.92 3.84 10.59 5.04 4.17

Here is a sample command:

python llama.py LLAMA_HF_FOLDER c4 --wbits 4 --true-sequential --act-order --new-eval

The --act-order heuristic also dramatically improves accuracy on the OPT-66B outlier model: 9.55 to 9.34 and 14.16 to 9.95 PPL on Wiki2 for 4bit and 3bit, respectively.

Dependencies

  • torch: tested on v1.10.1+cu111
  • transformers: tested on v4.21.2 (the LLaMa integration currently requires a main install from source and sentencepiece)
  • datasets: tested on v1.17.0
  • (to run 3-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.4)

All experiments were run on a single 80GB NVIDIA A100. However, most experiments will work on a GPU with a lot less memory as well.

Language Generation

OPT

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 [--groupsize 1024]

To run other OPT models replace opt-125m with one of: opt-350m, opt-1.3b, opt-2.7b, opt-6.7b, opt-13b, opt-66b. For the 175B-parameter mode, you have to request access from Meta and then convert it to a local HuggingFace checkpoint using their scripts in metaseq. Once you have such a checkpoint, simply pass its path instead of facebook/opt-125m.

BLOOM

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 [--groupsize 1024]

To run other BLOOM models replace bloom-560m with one of: bloom-1b1, bloom-1b7, bloom-3b, bloom-7b1, bloom.

ZeroShot

See zeroShot/ folder.

3-bit CUDA Kernels

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of OPT-175B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 3-bit OPT-175B:
# OPT175B denotes the name of the folder with the HuggingFace OPT-175b checkpoint (see above)

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --wbits 3 --save opt175-3bit.pt
# Benchmark generating a 128 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --load opt175b-3bit.pt --benchmark 128
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python opt.py OPT175B c4 --benchmark 128

Please note that our 3-bit kernels are currently only optimized for OPT-175B running on 1xA100 or 2xA6000 and may thus yield suboptimal performance on smaller models or on other GPUs.

Cite

If you found this work useful, please consider citing:

@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  year={2022},
  journal={arXiv preprint arXiv:2210.17323}
}

More Repositories

1

sparsegpt

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
Python
694
star
2

marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Python
557
star
3

qmoe

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
Python
258
star
4

PanzaMail

Python
254
star
5

QUIK

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference
C++
167
star
6

OBC

Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
Python
95
star
7

Sparse-Marlin

Boosting 4-bit inference kernels with 2:4 Sparsity
Cuda
46
star
8

WoodFisher

Code accompanying the NeurIPS 2020 paper: WoodFisher (Singh & Alistarh, 2020)
Python
45
star
9

SparseFinetuning

Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
Python
36
star
10

RoSA

Python
29
star
11

QIGen

Repository for CPU Kernel Generation for LLM Inference
Python
25
star
12

ACDC

Code for reproducing "AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks" (NeurIPS 2021)
Python
20
star
13

spdy

Code for ICML 2022 paper "SPDY: Accurate Pruning with Speedup Guarantees"
Python
18
star
14

M-FAC

Efficient reference implementations of the static & dynamic M-FAC algorithms (for pruning and optimization)
Python
16
star
15

torch_cgx

Pytorch distributed backend extension with compression support
C++
14
star
16

sparseprop

C++
13
star
17

peft-rosa

A fork of the PEFT library, supporting Robust Adaptation (RoSA)
Python
13
star
18

sparse-imagenet-transfer

Code for reproducing the results in "How Well do Sparse Imagenet Models Transfer?", presented at CVPR 2022
Python
8
star
19

CrAM

Code for reproducing the results from "CrAM: A Compression-Aware Minimizer" accepted at ICLR 2023
Python
8
star
20

MicroAdam

This repository contains code for the MicroAdam paper.
Python
8
star
21

spops

C++
6
star
22

ISTA-DASLab-Optimizers

Python
5
star
23

EFCP

The repository contains code to reproduce the experiments from our paper Error Feedback Can Accurately Compress Preconditioners available below:
Python
4
star
24

pruned-vision-model-bias

Code for reproducing the paper "Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures"
Jupyter Notebook
4
star
25

Mathador-LM

Code for the paper "Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on LLMs".
Python
4
star
26

CAP

Repository for Correlation Aware Prune (NeurIPS23) source and experimental code
Python
4
star
27

evolution-strategies

Python
2
star
28

TACO4NLP

Task aware compression for various NLP tasks
Python
2
star
29

smart-quantizer

Repository for Vitaly's implementation of the distribution-adaptive quantizer
Python
1
star
30

ZipLM

Code for the NeurIPS 2023 paper: "ZipLM: Inference-Aware Structured Pruning of Language Models".
1
star
31

QRGD

Repository for the implementation of "Distributed Principal Component Analysis with Limited Communication" (Alimisis et al., NeurIPS 2021). Parts of this code were originally based on code from "Communication-Efficient Distributed PCA by Riemannian Optimization" (Huang and Pan, ICML 2020).
MATLAB
1
star
32

KDVR

Code for the experiments in Knowledge Distillation Performs Partial Variance Reduction, NeurIPS 2023
Python
1
star