• Stars
    star
    321
  • Rank 130,752 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created over 4 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

HAT: Hardware Aware Transformers for Efficient Natural Language Processing [paper] [website] [video]

@inproceedings{hanruiwang2020hat,
    title     = {HAT: Hardware-Aware Transformers for Efficient Natural Language Processing},
    author    = {Wang, Hanrui and Wu, Zhanghao and Liu, Zhijian and Cai, Han and Zhu, Ligeng and Gan, Chuang and Han, Song},
    booktitle = {Annual Conference of the Association for Computational Linguistics},
    year      = {2020}
} 

Overview

We release the PyTorch code and 50 pre-trained models for HAT: Hardware-Aware Transformers. Within a Transformer supernet (SuperTransformer), we efficiently search for a specialized fast model (SubTransformer) for each hardware with latency feedback. The search cost is reduced by over 10000ร—. teaser

HAT Framework overview: overview

HAT models achieve up to 3ร— speedup and 3.7ร— smaller model size with no performance loss. results

Usage

Installation

To install from source and develop locally:

git clone https://github.com/mit-han-lab/hardware-aware-transformers.git
cd hardware-aware-transformers
pip install --editable .

Data Preparation

Task task_name Train Valid Test
WMT'14 En-De wmt14.en-de WMT'16 newstest2013 newstest2014
WMT'14 En-Fr wmt14.en-fr WMT'14 newstest2012&2013 newstest2014
WMT'19 En-De wmt19.en-de WMT'19 newstest2017 newstest2018
IWSLT'14 De-En iwslt14.de-en IWSLT'14 train set IWSLT'14 valid set IWSLT14.TED.dev2010
IWSLT14.TEDX.dev2012
IWSLT14.TED.tst2010
IWSLT14.TED.tst2011
IWSLT14.TED.tst2012

To download and preprocess data, run:

bash configs/[task_name]/preprocess.sh

If you find preprocessing time-consuming, you can directly download the preprocessed data we provide:

bash configs/[task_name]/get_preprocessed.sh

Testing

We provide pre-trained models (SubTransformers) on the Machine Translation tasks for evaluations. The #Params and FLOPs do not count in the embedding lookup table and the last output layers because they are dependent on tasks.

Task Hardware Latency #Params
(M)
FLOPs
(G)
BLEU Sacre
BLEU
model_name Link
WMT'14 En-De Raspberry Pi ARM Cortex-A72 CPU 3.5s
4.0s
4.5s
5.0s
6.0s
6.9s
25.22
29.42
35.72
36.77
44.13
48.33
1.53
1.78
2.19
2.26
2.70
3.02
25.8
26.9
27.6
27.8
28.2
28.4
25.6
26.6
27.1
27.2
27.6
27.8
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link
link
WMT'14 En-De Intel Xeon E5-2640 CPU 137.9ms
204.2ms
278.7ms
340.2ms
369.6ms
450.9ms
30.47
35.72
40.97
46.23
51.48
56.73
1.87
2.19
2.54
2.86
3.21
3.53
25.8
27.6
27.9
28.1
28.2
28.5
25.6
27.1
27.3
27.5
27.6
27.9
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link
link
WMT'14 En-De Nvidia TITAN Xp GPU 57.1ms
91.2ms
126.0ms
146.7ms
208.1ms
30.47
35.72
40.97
51.20
49.38
1.87
2.19
2.54
3.17
3.09
25.8
27.6
27.9
28.1
28.5
25.6
27.1
27.3
27.5
27.8
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link
WMT'14 En-Fr Raspberry Pi ARM Cortex-A72 CPU 4.3s
5.3s
5.8s
6.9s
7.8s
9.1s
25.22
35.72
36.77
44.13
49.38
56.73
1.53
2.23
2.26
2.70
3.09
3.57
38.8
40.1
40.6
41.1
41.4
41.8
36.0
37.3
37.8
38.3
38.5
38.9
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link
link
WMT'14 En-Fr Intel Xeon E5-2640 CPU 154.7ms
208.8ms
329.4ms
394.5ms
442.0ms
30.47
35.72
44.13
51.48
56.73
1.84
2.23
2.70
3.28
3.57
39.1
40.0
41.1
41.4
41.7
36.3
37.2
38.2
38.5
38.8
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link
WMT'14 En-Fr Nvidia TITAN Xp GPU 69.3ms
94.9ms
132.9ms
168.3ms
208.3ms
30.47
35.72
40.97
46.23
51.48
1.84
2.23
2.51
2.90
3.25
39.1
40.0
40.7
41.1
41.7
36.3
37.2
37.8
38.3
38.8
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link
WMT'19 En-De Nvidia TITAN Xp GPU 55.7ms
93.2ms
134.5ms
176.1ms
204.5ms
237.8ms
36.89
42.28
40.97
46.23
51.48
56.73
2.27
2.63
2.54
2.86
3.18
3.53
42.4
44.4
45.4
46.2
46.5
46.7
41.9
43.9
44.7
45.6
45.7
46.0
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link
link
IWSLT'14 De-En Nvidia TITAN Xp GPU 45.6ms
74.5ms
109.0ms
137.8ms
168.8ms
16.82
19.98
23.13
27.33
31.54
0.78
0.93
1.13
1.32
1.52
33.4
34.2
34.5
34.7
34.8
32.5
33.3
33.6
33.8
33.9
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
[email protected][email protected]
link
link
link
link
link

Download models:

python download_model.py --model-name=[model_name]
# for example
python download_model.py [email protected][email protected]
# to download all models
python download_model.py --download-all

Test BLEU (SacreBLEU) score:

bash configs/[task_name]/test.sh \
    [model_file] \
    configs/[task_name]/subtransformer/[model_name].yml \
    [normal|sacre]
# for example
bash configs/wmt14.en-de/test.sh \
    ./downloaded_models/[email protected][email protected] \
    configs/wmt14.en-de/subtransformer/[email protected][email protected] \
    normal
# another example
bash configs/iwslt14.de-en/test.sh \
    ./downloaded_models/[email protected][email protected] \
    configs/iwslt14.de-en/subtransformer/[email protected][email protected] \
    sacre

Test Latency, model size and FLOPs

To profile the latency, model size and FLOPs (FLOPs profiling needs torchprofile), you can run the commands below. By default, only the model size is profiled:

python train.py \
    --configs=configs/[task_name]/subtransformer/[model_name].yml \
    --sub-configs=configs/[task_name]/subtransformer/common.yml \
    [--latgpu|--latcpu|--profile-flops]
# for example
python train.py \
    --configs=configs/wmt14.en-de/subtransformer/[email protected][email protected] \
    --sub-configs=configs/wmt14.en-de/subtransformer/common.yml --latcpu
# another example
python train.py \
    --configs=configs/iwslt14.de-en/subtransformer/[email protected][email protected] \
    --sub-configs=configs/iwslt14.de-en/subtransformer/common.yml --profile-flops

Training

1. Train a SuperTransformer

The SuperTransformer is a supernet that contains many SubTransformers with weight-sharing. By default, we train WMT tasks on 8 GPUs. Please adjust --update-freq according to GPU numbers (128/x for x GPUs). Note that for IWSLT, we only train on one GPU with --update-freq=1.

python train.py --configs=configs/[task_name]/supertransformer/[search_space].yml
# for example
python train.py --configs=configs/wmt14.en-de/supertransformer/space0.yml
# another example
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --configs=configs/wmt14.en-fr/supertransformer/space0.yml --update-freq=32

In the --configs file, SuperTransformer model architecture, SubTransformer search space and training settings are specified.

We also provide pre-trained SuperTransformers for the four tasks as below. To download, run python download_model.py --model-name=[model_name].

Task search_space model_name Link
WMT'14 En-De space0 HAT_wmt14ende_super_space0 link
WMT'14 En-Fr space0 HAT_wmt14enfr_super_space0 link
WMT'19 En-De space0 HAT_wmt19ende_super_space0 link
IWSLT'14 De-En space1 HAT_iwslt14deen_super_space1 link

2. Evolutionary Search

The second step of HAT is to perform an evolutionary search in the trained SuperTransformer with a hardware latency constraint in the loop. We train a latency predictor to get fast and accurate latency feedback.

2.1 Generate a latency dataset
python latency_dataset.py --configs=configs/[task_name]/latency_dataset/[hardware_name].yml
# for example
python latency_dataset.py --configs=configs/wmt14.en-de/latency_dataset/cpu_raspberrypi.yml

hardware_name can be cpu_raspberrypi, cpu_xeon and gpu_titanxp. The --configs file contains the design space in which we sample models to get (model_architecture, real_latency) data pairs.

We provide the datasets we collect in the latency_dataset folder.

2.2 Train a latency predictor

Then train a predictor with collected dataset:

python latency_predictor.py --configs=configs/[task_name]/latency_predictor/[hardware_name].yml
# for example
python latency_predictor.py --configs=configs/wmt14.en-de/latency_predictor/cpu_raspberrypi.yml

The --configs file contains the predictor's model architecture and training settings. We provide pre-trained predictors in latency_dataset/predictors folder.

2.3 Run evolutionary search with a latency constraint
python evo_search.py --configs=[supertransformer_config_file].yml --evo-configs=[evo_settings].yml
# for example
python evo_search.py --configs=configs/wmt14.en-de/supertransformer/space0.yml --evo-configs=configs/wmt14.en-de/evo_search/wmt14ende_titanxp.yml

The --configs file points to the SuperTransformer training config file. --evo-configs file includes evolutionary search settings, and also specifies the desired latency constraint latency-constraint. Note that the feature-norm and lat-norm here should be the same as those when training the latency predictor. --write-config-path specifies the location to write out the searched SubTransformer architecture.

3. Train a Searched SubTransformer

Finally, we train the search SubTransformer from scratch:

python train.py --configs=[subtransformer_architecture].yml --sub-configs=configs/[task_name]/subtransformer/common.yml
# for example
python train.py --configs=configs/wmt14.en-de/subtransformer/[email protected] --sub-configs=configs/wmt14.en-de/subtransformer/common.yml

--configs points to the --write-config-path in step 2.3. --sub-configs contains training settings for the SubTransformer.

After training a SubTransformer, you can test its performance with the methods in Testing section.

Dependencies

  • Python >= 3.6
  • PyTorch >= 1.0.0
  • configargparse >= 0.14
  • New model training requires NVIDIA GPUs and NCCL

Related works on efficient deep learning

MicroNet for Efficient Language Modeling

Lite Transformer with Long-Short Range Attention

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

Once-for-All: Train One Network and Specialize it for Efficient Deployment

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

Contact

If you have any questions, feel free to contact Hanrui Wang through Email ([email protected]) or Github issues. Pull requests are highly welcomed!

Licence

This repository is released under the MIT license. See LICENSE for more information.

Acknowledgements

We are thankful to fairseq as the backbone of this repo.

More Repositories

1

streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
Python
6,530
star
2

bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
Python
2,286
star
3

temporal-shift-module

[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
Python
2,060
star
4

once-for-all

[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
Python
1,866
star
5

llm-awq

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Python
1,687
star
6

proxylessnas

[ICLR 2019] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
C++
1,420
star
7

torchquantum

A PyTorch-based framework for Quantum Classical Simulation, Quantum Machine Learning, Quantum Neural Networks, Parameterized Quantum Circuits with support for easy deployments on real quantum computers.
Jupyter Notebook
1,304
star
8

data-efficient-gans

[NeurIPS 2020] Differentiable Augmentation for Data-Efficient GAN Training
Python
1,277
star
9

efficientvit

EfficientViT is a new family of vision models for efficient high-resolution vision.
Python
1,218
star
10

torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Cuda
1,181
star
11

smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Python
1,175
star
12

gan-compression

[CVPR 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs
Python
1,104
star
13

anycost-gan

[CVPR 2021] Anycost GANs for Interactive Image Synthesis and Editing
Python
778
star
14

tinyml

Python
755
star
15

TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library
C++
730
star
16

tinyengine

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory
C
717
star
17

fastcomposer

[IJCV] FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
Python
644
star
18

pvcnn

[NeurIPS 2019, Spotlight] Point-Voxel CNN for Efficient 3D Deep Learning
Python
639
star
19

lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention
Python
589
star
20

spvnas

[ECCV 2020] Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
Python
577
star
21

distrifuser

[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
Python
538
star
22

mcunet

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
Python
460
star
23

tiny-training

On-Device Training Under 256KB Memory [NeurIPS'22]
Python
432
star
24

amc

[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Python
428
star
25

dlg

[NeurIPS 2019] Deep Leakage From Gradients
Python
400
star
26

haq

[CVPR 2019, Oral] HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Python
368
star
27

offsite-tuning

Offsite-Tuning: Transfer Learning without Full Model
Python
365
star
28

litepose

[CVPR'22] Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation
Python
304
star
29

inter-operator-scheduler

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
C++
191
star
30

amc-models

[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Python
166
star
31

apq

[CVPR 2020] APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
Python
156
star
32

parallel-computing-tutorial

C++
134
star
33

flatformer

[CVPR'23] FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
Python
119
star
34

patch_conv

Patch convolution to avoid large GPU memory usage of Conv2D
Python
74
star
35

6s965-fall2022

Jupyter Notebook
64
star
36

sparsevit

[CVPR'23] SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Python
48
star
37

bnn-icestick

Binary Neural Network on IceStick FPGA.
Jupyter Notebook
47
star
38

e3d

Efficient 3D Deep Learning
46
star
39

neurips-micronet

[JMLR'20] NeurIPS 2019 MicroNet Challenge Efficient Language Modeling, Champion
Jupyter Notebook
40
star
40

spatten-llm

[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Scala
32
star
41

tinychat-tutorial

C++
28
star
42

pruning-sparsity-publications

14
star
43

iccad-tinyml-open

[ICCAD'22 TinyML Contest] Efficient Heart Stroke Detection on Low-cost Microcontrollers
C
14
star
44

calo-cluster

Jupyter Notebook
5
star
45

ml-blood-pressure

Python
5
star
46

gan-compression-dynamic

Python
3
star
47

data-efficient-gans-dynamic

Python
3
star