• Stars
    star
    972
  • Rank 46,755 (Top 1.0 %)
  • Language
    Python
  • License
    Other
  • Created over 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

EfficientFormerV2 [ICCV 2023] & EfficientFormer [NeurIPs 2022]

EfficientFormerV2
Rethinking Vision Transformers for MobileNet Size and Speed

arXiv | PDF


Models are trained on ImageNet-1K and deployed on iPhone 12 with CoreMLTools to get latency.

Rethinking Vision Transformers for MobileNet Size and Speed
Yanyu Li1,2, Ju Hu1, Yang Wen1, Georgios Evangelidis1, Kamyar Salahi3,
Yanzhi Wang2, Sergey Tulyakov1, Jian Ren1
1Snap Inc., 2Northeastern University, 3UC Berkeley

Abstract With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose an improved supernet with low latency and high parameter efficiency. We further introduce a fine-grained joint search strategy that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve about 4% higher top-1 accuracy than MobileNetV2 and MobileNetV2x1.4 on ImageNet-1K with similar latency and parameters. We demonstrate that properly designed and optimized vision transformers can achieve high performance with MobileNet-level size and speed.

Changelog and ToDos

  • Add EfficientFormerV2 model family, including efficientformerv2_s0, efficientformerv2_s1, efficientformerv2_s2 and efficientformerv2_l.
  • Pretrained checkpoints of EfficientFormerV2 on ImageNet-1K are released.
  • Update EfficientFormerV2 in downstream tasks (detection, segmentation).
  • Release checkpoints in downstream tasks.
  • Add extra tools for profiling and deployment (we use CoreML==5.2 and Torch==1.11), example usage:
python toolbox.py --model efficientformerv2_l --ckpt weights/eformer_l_450.pth --onnx --coreml

EfficientFormer
Vision Transformers at MobileNet Speed

arXiv | PDF


Models are trained on ImageNet-1K and measured by iPhone 12 with CoreMLTools to get latency.

EfficientFormer: Vision Transformers at MobileNet Speed
Yanyu Li1,2, Genge Yuan1,2, Yang Wen1, Eric Hu1, Georgios Evangelidis1,
Sergey Tulyakov1, Yanzhi Wang2, Jian Ren1
1Snap Inc., 2Northeastern University

Abstract Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2x1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.

Classification on ImageNet-1K

Models

Model Top-1 (300/450) #params MACs Latency ckpt ONNX CoreML
EfficientFormerV2-S0 75.7 / 76.2 3.5M 0.40B 0.9ms S0 S0 S0
EfficientFormerV2-S1 79.0 / 79.7 6.1M 0.65B 1.1ms S1 S1 S1
EfficientFormerV2-S2 81.6 / 82.0 12.6M 1.25B 1.6ms S2 S2 S2
EfficientFormerV2-L 83.3 / 83.5 26.1M 2.56B 2.7ms L L L
Model Top-1 Acc. Latency Pytorch Checkpoint CoreML ONNX
EfficientFormer-L1 79.2 (80.2) 1.6ms L1-300 (L1-1000) L1 L1
EfficientFormer-L3 82.4 3.0ms L3 L3 L3
EfficientFormer-L7 83.3 7.0ms L7 L7 L7

Latency Measurement

The latency reported in EffcientFormerV2 for iPhone 12 (iOS 16) uses the benchmark tool from XCode 14.

For EffcientFormerV1, we use the coreml-performance. Thanks for the nice-implemented latency measurement!

Tips: MacOS+XCode and a mobile device (iPhone 12) are needed to reproduce the reported speed.

ImageNet

Prerequisites

conda virtual environment is recommended.

conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
pip install timm
pip install submitit

Data preparation

Download and extract ImageNet train and val images from http://image-net.org/. The training and validation data are expected to be in the train folder and val folder respectively:

|-- /path/to/imagenet/
    |-- train
    |-- val

Single machine multi-GPU training

We provide an example training script dist_train.sh using PyTorch distributed data parallel (DDP).

To train EfficientFormer-L1 on an 8-GPU machine:

sh dist_train.sh efficientformer_l1 8

Tips: specify your data path and experiment name in the script!

Multi-node training

On a Slurm-managed cluster, multi-node training can be launched through submitit, for example,

sh slurm_train.sh efficientformer_l1

Tips: specify GPUs/CPUs/memory per node in the script based on your resource!

Testing

We provide an example test script dist_test.sh using PyTorch distributed data parallel (DDP). For example, to test EfficientFormer-L1 on an 8-GPU machine:

sh dist_test.sh efficientformer_l1 8 weights/efficientformer_l1_300d.pth

Using EfficientFormer as backbone

Object Detection and Instance Segmentation
Semantic Segmentation

Acknowledgement

Classification (ImageNet) code base is partly built with LeViT and PoolFormer.

The detection and segmentation pipeline is from MMCV (MMDetection and MMSegmentation).

Thanks for the great implementations!

Citation

If our code or models help your work, please cite EfficientFormer (NeurIPs 2022) and EfficientFormerV2 (ICCV 2023):

@article{li2022efficientformer,
  title={Efficientformer: Vision transformers at mobilenet speed},
  author={Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Ju and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={12934--12949},
  year={2022}
}
@inproceedings{li2022rethinking,
  title={Rethinking Vision Transformers for MobileNet Size and Speed},
  author={Li, Yanyu and Hu, Ju and Wen, Yang and Evangelidis, Georgios and Salahi, Kamyar and Wang, Yanzhi and Tulyakov, Sergey and Ren, Jian},
  booktitle={Proceedings of the IEEE international conference on computer vision},
  year={2023}
}

More Repositories

1

articulated-animation

Code for Motion Representations for Articulated Animation paper
Jupyter Notebook
1,210
star
2

NeROIC

Python
909
star
3

HyperHuman

[ICLR 2024] Github Repo for "HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion"
HTML
489
star
4

Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Python
459
star
5

MoCoGAN-HD

[ICLR 2021 Spotlight] A Good Image Generator Is What You Need for High-Resolution Video Synthesis
Python
240
star
6

3dgp

3D generation on ImageNet [ICLR 2023]
Python
207
star
7

MMVID

[CVPR 2022] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
Python
194
star
8

MobileR2L

[CVPR 2023] Real-Time Neural Light Field on Mobile Devices
Python
192
star
9

R2L

[ECCV 2022] R2L: Distilling Neural Radiance Field to Neural Light Field for Efficient Novel View Synthesis
Python
189
star
10

CAT

[CVPR 2021] Teachers Do More Than Teach: Compressing Image-to-Image Models (CAT)
Python
180
star
11

discoscene

CVPR 2023 Highlight: DiscoScene
Python
138
star
12

3DVADER

Source code for the paper: "AutoDecoding Latent 3D Diffusion Models"
132
star
13

BitsFusion

118
star
14

SnapFusion

HTML
95
star
15

F8Net

[ICLR 2022 Oral] F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization
Python
95
star
16

SF-V

This respository contains the code for SF-V: Single Forward Video Generation Model.
82
star
17

AToM

Official implementation of `AToM: Amortized Text-to-Mesh using 2D Diffusion`
82
star
18

graphless-neural-networks

[ICLR 2022] Code for Graph-less Neural Networks: Teaching Old MLPs New Tricks via Distillation (GLNN)
Python
75
star
19

MLPInit-for-GNNs

[ICLR 2023] MLPInit: Embarrassingly Simple GNN Training Acceleration with MLP Initialization
Jupyter Notebook
69
star
20

unsupervised-volumetric-animation

The repository for paper Unsupervised Volumetric Animation
Python
67
star
21

non-contrastive-link-prediction

[ICLR 2023] Link Prediction with Non-Contrastive Learning
Python
26
star
22

linkless-link-prediction

[ICML 2023] Linkless Link Prediction via Relational Distillation
Python
18
star
23

locomo

Python
15
star
24

LargeGT

Graph Transformers for Large Graphs
Python
13
star
25

efficient-nn-tutorial

Page for the CVPR 2023 Tutorial - Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployments
HTML
13
star
26

weights2weights

Official Implementation of weights2weights
12
star
27

SpFDE

[NeurIPs 2022] Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training
11
star
28

representations-for-creativity

HTML
7
star
29

hpdm

Hierarchical Patch Diffusion Models for High-Resolution Video Synthesis [CVPR 2024]
HTML
7
star
30

video-synthesis-tutorial

HTML
5
star
31

snap-research-website

https://research.snap.com/
HTML
2
star
32

promptable-game-models

2
star
33

NeurT-FDR

NeurT-FDR, a method for controlling false discovery rate by incorporating feature hierarchy
Python
2
star
34

qfar

Official implementation of MobiCom 2023 paper "QfaR: Location-Guided Scanning of Visual Codes from Long Distances"
Python
1
star
35

cabam-graph-generation

[KDD MLG'20] Class-Assortative Barabasi Albert Model for Graph Generation
Jupyter Notebook
1
star
36

cv-call-for-interns-2022

HTML
1
star
37

NodeDup

Node Duplication Improves Cold-start Link Prediction
Python
1
star
38

SPAD

Source code for paper "SPAD: Spatially Aware Multi-View Diffusers"
1
star
39

snapvideo

HTML
1
star