Post-Training Quantization Framework for Vision Transformers. We use the twin uniform quantization method to reduce the quantization error on these activation values. And we use a Hessian guided metric to evaluate different scaling factors, which improves the accuracy of calibration with a small cost. The quantized vision transformers (ViT, DeiT, and Swin) achieve near-lossless prediction accuracy (less than 0.5% drop at 8-bit quantization) on the ImageNet classification task. Please read the paper for details.
19/07/2022 Add discussion on Base PTQ, and provide more ablation study results.
Model | W8A8 #ims=32 | W6A6 #ims=32 | W8A8 #ims=128 | W6A6 #ims=128 |
---|---|---|---|---|
ViT-S/224/32 | 75.58 | 71.91 | 75.54 | 72.29 |
ViT-S/224 | 81.00 | 78.63 | 80.99 | 78.44 |
ViT-B/224 | 84.25 | 81.65 | 84.27 | 81.84 |
ViT-B/384 | 85.83 | 83.35 | 85.81 | 83.84 |
DeiT-S/224 | 79.47 | 76.28 | 79.41 | 76.51 |
DeiT-B/224 | 81.48 | 80.25 | 81.54 | 80.30 |
DeiT-B/384 | 82.97 | 81.55 | 83.01 | 81.67 |
Swin-T/224 | 81.25 | 80.47 | 81.27 | 80.30 |
Swin-S/224 | 83.11 | 82.38 | 83.15 | 82.38 |
Swin-B/224 | 85.15 | 84.01 | 85.17 | 84.15 |
Swin-B/384 | 86.39 | 85.39 | 86.36 | 85.45 |
Model | Time #ims=32 | Time #ims=128 |
---|---|---|
ViT-S/224/32 | 2 min | 5 min |
ViT-S/224 | 3 min | 7 min |
ViT-B/224 | 4 min | 13 min |
ViT-B/384 | 12 min | 43 min |
DeiT-S/224 | 3 min | 7 min |
DeiT-B/224 | 4 min | 16 min |
DeiT-B/384 | 14 min | 52 min |
Swin-T/224 | 3 min | 9 min |
Swin-S/224 | 8 min | 17 min |
Swin-B/224 | 10 min | 23 min |
Swin-B/384 | 25 min | 69 min |
One of the targets of PTQ4ViT is to quickly quantize a vision transformer.
We have proposed to pre-compute the output and gradient of each layer and compute the influence of scaling factor candidates in batches to reduce the quantization time.
As demonstrated in the second table, PTQ4ViT can quantize most vision transformers in several minutes using 32 calibration images.
Using 128 calibration images significantly increases the quantization time.
We observe the Top-1 accuracy varies slightly in the first table, demonstrating PTQ4ViT is not very sensitive to the number of calibration images.
Base PTQ is a simple quantization strategy and serves as a benchmark for our experiments.
Like PTQ4ViT, we quantize all weights and inputs for fully-connect layers (including the first projection layer and the last prediction layer), as well as all input matrices of matrix multiplication operations.
For fully-connected layers, we use layerwise scaling factors
To get the best scaling factors, we apply a linear grid search on the search space.
The same as EasyQuantand Liu et al., we take hyper-parameters
It should be noticed that Base PTQ adopts a parallel quantization paradigm, which makes it essentially different from sequential quantization paradigms such as EasyQuant. In sequential quantization, the input data of the current quantizing layer is generated with all previous layers quantizing weights and activations. While in parallel quantization, the input data of the current quantizing layer is simply the raw output of the previous layer.
In practice, we found sequential quantization on vision transformers suffers from significant accuracy degradation on small calibration datasets. While parallel quantization shows robustness on small calibration datasets. Therefore, we choose parallel quantization for both Base PTQ and PTQ4ViT.
We supply more ablation studies for the hyper-parameters.
It is enough to set the number of quantization intervals
We randomly take 32 calibration images to quantize different models 20 times and we observe the fluctuation is not significant.
The mean/std of accuracies are: ViT-S/32
15/01/2022 Add saved quantized models with PTQ4ViT.
model | link |
---|---|
ViT-S/224/32 | |
ViT-S/224 | |
ViT-B/224 | |
ViT-B/384 | |
DeiT-S/224 | |
DeiT-B/224 | |
DeiT-B/384 | |
Swin-T/224 | |
Swin-S/224 | |
Swin-B/224 | |
Swin-B/384 |
10/12/2021
Add utils/integer.py
, you can now:
- convert calibrated fp32 model into int8
- register pre-forward hook in the model, and fetch activation in int8. (We use uint8 to store results of twin quantization, please refer to the paper to see the bits' layout).
- python>=3.5
- pytorch>=1.5
- matplotlib
- pandas
- timm
To run example testing, you should put your ImageNet2012 dataset in path /datasets/imagenet
.
We use ViTImageNetLoaderGenerator
in utils/datasets.py
to initialize our DataLoader.
If your Imagenet datasets are stored elsewhere, you'll need to manually pass its root as an argument when instantiating a ViTImageNetLoaderGenerator
.
To test on all models with BasePTQ/PTQ4ViT, run
python example/test_all.py
To run ablation testing, run
python example/test_ablation.py
You can run the testing scripts with multiple GPUs. For example, calling
python example/test_all.py --multigpu --n_gpu 6
will use 6 gpus to run the test.
(Coming soon)
model | original | w8a8 | w6a6 |
---|---|---|---|
ViT-S/224/32 | 75.99 | 73.61 | 60.144 |
ViT-S/224 | 81.39 | 80.468 | 70.244 |
ViT-B/224 | 84.54 | 83.896 | 75.668 |
ViT-B/384 | 86.00 | 85.352 | 46.886 |
DeiT-S/224 | 79.80 | 77.654 | 72.268 |
DeiT-B/224 | 81.80 | 80.946 | 78.786 |
DeiT-B/384 | 83.11 | 82.33 | 68.442 |
Swin-T/224 | 81.39 | 80.962 | 78.456 |
Swin-S/224 | 83.23 | 82.758 | 81.742 |
Swin-B/224 | 85.27 | 84.792 | 83.354 |
Swin-B/384 | 86.44 | 86.168 | 85.226 |
Results of PTQ4ViT
model | original | w8a8 | w6a6 |
---|---|---|---|
ViT-S/224/32 | 75.99 | 75.582 | 71.908 |
ViT-S/224 | 81.39 | 81.002 | 78.63 |
ViT-B/224 | 84.54 | 84.25 | 81.65 |
ViT-B/384 | 86.00 | 85.828 | 83.348 |
DeiT-S/224 | 79.80 | 79.474 | 76.282 |
DeiT-B/224 | 81.80 | 81.482 | 80.25 |
DeiT-B/384 | 83.11 | 82.974 | 81.55 |
Swin-T/224 | 81.39 | 81.246 | 80.47 |
Swin-S/224 | 83.23 | 83.106 | 82.38 |
Swin-B/224 | 85.27 | 85.146 | 84.012 |
Swin-B/384 | 86.44 | 86.394 | 85.388 |
- ViT-S/224 (original top-1 accuracy 81.39%)
Hessian Guided | Softmax Twin | GELU Twin | W8A8 | W6A6 |
---|---|---|---|---|
80.47 | 70.24 | |||
✓ | 80.93 | 77.20 | ||
✓ | ✓ | 81.11 | 78.57 | |
✓ | ✓ | 80.84 | 76.93 | |
✓ | ✓ | 79.25 | 74.07 | |
✓ | ✓ | ✓ | 81.00 | 78.63 |
- ViT-B/224 (original top-1 accuracy 84.54%)
Hessian Guided | Softmax Twin | GELU Twin | W8A8 | W6A6 |
---|---|---|---|---|
83.90 | 75.67 | |||
✓ | 83.97 | 79.90 | ||
✓ | ✓ | 84.07 | 80.76 | |
✓ | ✓ | 84.10 | 80.82 | |
✓ | ✓ | 83.40 | 78.86 | |
✓ | ✓ | ✓ | 84.25 | 81.65 |
- ViT-B/384 (original top-1 accuracy 86.00%)
Hessian Guided | Softmax Twin | GELU Twin | W8A8 | W6A6 |
---|---|---|---|---|
85.35 | 46.89 | |||
✓ | 85.42 | 79.99 | ||
✓ | ✓ | 85.67 | 82.01 | |
✓ | ✓ | 85.60 | 82.21 | |
✓ | ✓ | 84.35 | 80.86 | |
✓ | ✓ | ✓ | 85.89 | 83.19 |
@article{PTQ4ViT_arixv2022,
title={PTQ4ViT: Post-Training Quantization Framework for Vision Transformers},
author={Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, Guangyu Sun},
journal={arXiv preprint arXiv:2111.12293},
year={2022},
}