• Stars
    star
    2,957
  • Rank 15,326 (Top 0.4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 2 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

4 bits quantization of LLaMA using GPTQ

GPTQ-for-LLaMA

4 bits quantization of LLaMA using GPTQ

GPTQ is SOTA one-shot weight quantization method

It can be used universally, but it is not the fastest and only supports linux.

Triton only supports Linux, so if you are a Windows user, please use WSL2.

News or Update

AutoGPTQ-triton, a packaged version of GPTQ with triton, has been integrated into AutoGPTQ.

Result

LLaMA-7B(click me)
LLaMA-7B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - 13940 5.68 12.5
RTN 4 - - 6.29 -
GPTQ 4 - 4740 6.09 3.5
GPTQ 4 128 4891 5.85 3.6
RTN 3 - - 25.54 -
GPTQ 3 - 3852 8.07 2.7
GPTQ 3 128 4116 6.61 3.0
LLaMA-13B
LLaMA-13B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 5.09 24.2
RTN 4 - - 5.53 -
GPTQ 4 - 8410 5.36 6.5
GPTQ 4 128 8747 5.20 6.7
RTN 3 - - 11.40 -
GPTQ 3 - 6870 6.63 5.1
GPTQ 3 128 7277 5.62 5.4
LLaMA-33B
LLaMA-33B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 4.10 60.5
RTN 4 - - 4.54 -
GPTQ 4 - 19493 4.45 15.7
GPTQ 4 128 20570 4.23 16.3
RTN 3 - - 14.89 -
GPTQ 3 - 15493 5.69 12.0
GPTQ 3 128 16566 4.80 13.0
LLaMA-65B
LLaMA-65B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 3.53 121.0
RTN 4 - - 3.92 -
GPTQ 4 - OOM 3.84 31.1
GPTQ 4 128 OOM 3.65 32.3
RTN 3 - - 10.59 -
GPTQ 3 - OOM 5.04 23.6
GPTQ 3 128 OOM 4.17 25.6

Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

GPTQ vs bitsandbytes

LLaMA-7B(click me)
LLaMA-7B(seqlen=2048) Bits Per Weight(BPW) memory(MiB) c4(ppl)
FP16 16 13948 5.22
GPTQ-128g 4.15 4781 5.30
nf4-double_quant 4.127 4804 5.30
nf4 4.5 5102 5.30
fp4 4.5 5102 5.33
LLaMA-13B
LLaMA-13B(seqlen=2048) Bits Per Weight(BPW) memory(MiB) c4(ppl)
FP16 16 OOM -
GPTQ-128g 4.15 8589 5.02
nf4-double_quant 4.127 8581 5.04
nf4 4.5 9170 5.04
fp4 4.5 9170 5.11
LLaMA-33B
LLaMA-33B(seqlen=1024) Bits Per Weight(BPW) memory(MiB) c4(ppl)
FP16 16 OOM -
GPTQ-128g 4.15 18441 3.71
nf4-double_quant 4.127 18313 3.76
nf4 4.5 19729 3.75
fp4 4.5 19729 3.75

Installation

If you don't have conda, install it first.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt

Dependencies

  • torch: tested on v2.0.0+cu117
  • transformers: tested on v4.28.0.dev0
  • datasets: tested on v2.10.1
  • safetensors: tested on v0.3.0

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMA

#convert LLaMA to hf
python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf

# Benchmark language generation with 4-bit LLaMA-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt

# Or save compressed `.safetensors` model
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors

# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check

# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ${MODEL_DIR} c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"

# model inference with the saved model using safetensors loaded direct to gpu
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --text "this is llama" --device=0

# model inference with the saved model with offload(This is very slow).
CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.

Basically, 4-bit quantization and 128 groupsize are recommended.

You can also export quantization parameters with toml+numpy format.

CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --quant-directory ${TOML_DIR}

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

Triton GPTQ kernel code is based on GPTQ-triton

More Repositories

1

lama-with-maskdino

automatic image inpainting (lama(with refinement) and maskdino)
Python
35
star
2

GPTQ-for-KoAlpaca

Python
14
star
3

SoftPool

softpool implementation(Refining activation downsampling with SoftPool) This is an unofficial implementation. https://arxiv.org/pdf/2101.00440v2.pdf
Python
14
star
4

MaxVIT-pytorch

MaxVIT implementation(MaxViT: Multi-Axis Vision Transformer) This is an unofficial implementation. https://arxiv.org/abs/2204.01697
Python
9
star
5

llama-danbooru-qlora

Jupyter Notebook
8
star
6

AutoQuant

Python
8
star
7

NatIR

NatIR: Image Restoration Using Neighborhood-Attention-Transformer
Python
6
star
8

Neighborhood-Attention-Transformer

NAT implementation(Neighborhood Attention Transformer) This is an unofficial implementation. https://arxiv.org/pdf/2204.07143.pdf
Python
6
star
9

pale-transformer-pytorch

Pale Transformer implementation(Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention) This is an unofficial implementation. https://arxiv.org/abs/2112.14000
Python
5
star
10

KoLIMA

Jupyter Notebook
4
star
11

Magneto-pytorch

Magneto implementation(Foundation Transformers) This is an unofficial implementation. https://arxiv.org/abs/2210.06423
Python
3
star
12

MLP-Mixer-tf2

MLP-Mixer implementation(MLP-Mixer: An all-MLP Architecture for Vision) This is an unofficial implementation. https://arxiv.org/pdf/2105.01601v1.pdf
Python
3
star
13

Swin-MLP-Mixer

This code is a structure that combines swim-transformer and mlp mixer, and performance may be poor because I didn’t train and test this model
Python
3
star
14

D-Adaptation-Adan

Adan with D-Adaptation automatic step-sizes
Python
2
star
15

halonet-tf2

halonet implementation(Scaling Local Self-Attention for Parameter Efficient Visual Backbones) This is an unofficial implementation.https://arxiv.org/pdf/2103.12731v2.pdf
Python
1
star
16

psnr_ssim_ycbcr

Code for calculating psnr and ssim in y channel in ycbcr.This code is based on BasicSR (https://github.com/xinntao/BasicSR).
Python
1
star
17

Subtitles-generator-with-whisper

Subtitles generator using whisper and translator
Python
1
star