• Stars
    star
    426
  • Rank 101,884 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Next-ViT

This repo is the official implementation of "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios". This algorithm is proposed by ByteDance, Intelligent Creation, AutoML Team (字节跳动-智能创作 AutoML团队).

Updates

08/16/2022

  1. Pretrained models on large scale dataset follow [SSLD] are provided.
  2. Segmentation results with large scale dataset pretrained model are also presented.

Overview

Figure 1. The overall hierarchical architecture of Next-ViT.

Introduction

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6×. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Next-ViT-R

Figure 2. Comparison among Next-ViT and efficient Networks, in terms of accuracy-latency trade-off.

Usage

First, clone the repository locally:

git clone https://github.com/bytedance/Next-ViT.git

Then, install torch=1.10.0, mmcv-full==1.5.0, timm==0.4.9 and etc.

pip3 install -r requirements.txt

Data preparation

Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision datasets.ImageFolder, and the training and validation data is expected to be in the train/ folder and val/ folder respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Image Classification

We provide a series of Next-ViT models pretrained on ILSVRC2012 ImageNet-1K dataset. More details can be seen in [paper].

Model Dataset Resolution FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
Acc@1 ckpt log
Next-ViT-S ImageNet-1K 224 5.8 31.7 7.7 3.5 82.5 ckpt log
Next-ViT-B ImageNet-1K 224 8.3 44.8 10.5 4.5 83.2 ckpt log
Next-ViT-L ImageNet-1K 224 10.8 57.8 13.0 5.5 83.6 ckpt log
Next-ViT-S ImageNet-1K 384 17.3 31.7 21.6 8.9 83.6 ckpt log
Next-ViT-B ImageNet-1K 384 24.6 44.8 29.6 12.4 84.3 ckpt log
Next-ViT-L ImageNet-1K 384 32.0 57.8 36.0 15.2 84.7 ckpt log

We also provide a series of Next-ViT models pretrained on large scale dataset follow [SSLD]. More details can be seen in [paper].

Model Dataset Resolution FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
Acc@1 ckpt
Next-ViT-S ImageNet-1K-6M 224 5.8 31.7 7.7 3.5 84.8 ckpt
Next-ViT-B ImageNet-1K-6M 224 8.3 44.8 10.5 4.5 85.1 ckpt
Next-ViT-L ImageNet-1K-6M 224 10.8 57.8 13.0 5.5 85.4 ckpt
Next-ViT-S ImageNet-1K-6M 384 17.3 31.7 21.6 8.9 85.8 ckpt
Next-ViT-B ImageNet-1K-6M 384 24.6 44.8 29.6 12.4 86.1 ckpt
Next-ViT-L ImageNet-1K-6M 384 32.0 57.8 36.0 15.2 86.4 ckpt

Training

To train Next-ViT-S on ImageNet using 8 gpus for 300 epochs, run:

cd classification/
bash train.sh 8 --model nextvit_small --batch-size 256 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path your_imagenet_path

Finetune Next-ViT-S with 384x384 input size for 30 epochs, run:

cd classification/
bash train.sh 8 --model nextvit_small --batch-size 128 --lr 5e-6 --warmup-epochs 0 --weight-decay 1e-8 --epochs 30 --sched step --decay-epochs 60 --input-size 384 --resume ../checkpoints/nextvit_small_in1k_224.pth --finetune --data-path your_imagenet_path 

Evaluation

To evaluate the performance of Next-ViT-S on ImageNet using 8 gpus, run:

cd classification/
bash train.sh 8 --model nextvit_small --batch-size 256 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path your_imagenet_path --resume ../checkpoints/nextvit_small_in1k_224.pth --eval

Detection

Our code is based on mmdetection, please install mmdetection==2.23.0. Next-ViT serve as the strong backbones for Mask R-CNN. It's easy to apply Next-ViT in other detectors provided by mmdetection based on our examples. More details can be seen in [paper].

Mask R-CNN

Backbone Pretrained Lr Schd Param.(M) FLOPs(G) TensorRT
Latency(ms)
CoreML
Latency(ms)
bbox mAP mask mAP ckpt log
Next-ViT-S ImageNet-1K 1x 51.8 290 38.2 18.1 45.9 41.8 ckpt log
Next-ViT-S ImageNet-1K 3x 51.8 290 38.2 18.1 48.0 43.2 ckpt log
Next-ViT-B ImageNet-1K 1x 64.9 340 51.6 24.4 47.2 42.8 ckpt log
Next-ViT-B ImageNet-1K 3x 64.9 340 51.6 24.4 49.5 44.4 ckpt log
Next-ViT-L ImageNet-1K 1x 77.9 391 65.3 30.1 48.0 43.2 ckpt log
Next-ViT-L ImageNet-1K 3x 77.9 391 65.3 30.1 50.2 44.8 ckpt log

Training

To train Mask R-CNN with Next-ViT-S backbone using 8 gpus, run:

cd detection/
PORT=29501 bash dist_train.sh configs/mask_rcnn_nextvit_small_1x.py 8

Evaluation

To evaluate Mask R-CNN with Next-ViT-S backbone using 8 gpus, run:

cd detection/
PORT=29501 bash dist_test.sh configs/mask_rcnn_nextvit_small_1x.py ../checkpoints/mask_rcnn_1x_nextvit_small.pth 8 --eval bbox

Semantic Segmentation

Our code is based on mmsegmentation, please install mmsegmentation==0.23.0. Next-ViT serve as the strong backbones for segmentation tasks on ADE20K dataset. It's easy to extend it to other datasets and segmentation methods. More details can be seen in [paper].

Semantic FPN 80k

Backbone Pretrained FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
mIoU ckpt log
Next-ViT-S ImageNet-1K 208 36.3 38.2 18.1 46.5 ckpt log
Next-ViT-B ImageNet-1K 260 49.3 51.6 24.4 48.6 ckpt log
Next-ViT-L ImageNet-1K 331 62.4 65.3 30.1 49.1 ckpt log
Next-ViT-S ImageNet-1K-6M 208 36.3 38.2 18.1 48.8 ckpt log
Next-ViT-B ImageNet-1K-6M 260 49.3 51.6 24.4 50.2 ckpt log
Next-ViT-L ImageNet-1K-6M 331 62.4 65.3 30.1 50.5 ckpt log

UperNet 160k

Backbone Pretrained FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
mIoU(ss/ms) ckpt log
Next-ViT-S ImageNet-1K 968 66.3 38.2 18.1 48.1/49.0 ckpt log
Next-ViT-B ImageNet-1K 1020 79.3 51.6 24.4 50.4/51.1 ckpt log
Next-ViT-L ImageNet-1K 1072 92.4 65.3 30.1 50.1/50.8 ckpt log
Next-ViT-S ImageNet-1K-6M 968 66.3 38.2 18.1 49.8/50.8 ckpt log
Next-ViT-B ImageNet-1K-6M 1020 79.3 51.6 24.4 51.8/52.8 ckpt log
Next-ViT-L ImageNet-1K-6M 1072 92.4 65.3 30.1 51.5/52.0 ckpt log

Training

To train Semantic FPN 80k with Next-ViT-S backbone using 8 gpus, run:

cd segmentation/
PORT=29501 bash dist_train.sh configs/fpn_512_nextvit_small_80k.py 8

Evaluation

To evaluate Semantic FPN 80k(single scale) with Next-ViT-S backbone using 8 gpus, run:

cd segmentation/
PORT=29501 bash dist_test.sh configs/fpn_512_nextvit_small_80k.py ../checkpoints/fpn_80k_nextvit_small.pth 8 --eval mIoU

Deployment and Latency Measurement

we provide scripts to convert Next-ViT from pytorch model to CoreML model and TensorRT engine.

CoreML

Convert Next-ViT-S to CoreML model with coremltools==5.2.0, run:

cd deployment/
python3 export_coreml_model.py --model nextvit_small --batch-size 1 --image-size 224
Backbone Resolution FLOPs (G) CoreML
Latency(ms)
CoreML Model
Next-ViT-S 224 5.8 3.5 mlmodel
Next-ViT-B 224 8.3 4.5 mlmodel
Next-ViT-L 224 10.8 5.5 mlmodel

We uniformly benchmark CoreML Latency on an iPhone12 Pro Max(iOS 16.0) with Xcode 14.0. The performance report of CoreML model can be generated with Xcode 14.0 directly(new feature of Xcode 14.0).
Next-ViT-R

Figure 3. CoreML latency of Next-ViT-S/B/L.

TensorRT

Convert Next-ViT-S to TensorRT engine with tensorrt==8.0.3.4, run:

cd deployment/
python3 export_tensorrt_engine.py --model nextvit_small --batch-size 8  --image-size 224 --datatype fp16 --profile True --trtexec-path /usr/bin/trtexec

Citation

If you find this project useful in your research, please consider cite:

@article{li2022next,
  title={Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios},
  author={Li, Jiashi and Xia, Xin and Li, Wei and Li, Huixia and Wang, Xing and Xiao, Xuefeng and Wang, Rui and Zheng, Min and Pan, Xin},
  journal={arXiv preprint arXiv:2207.05501},
  year={2022}
}

Acknowledgement

We heavily borrow the code from Twins.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

More Repositories

1

IconPark

🍎Transform an SVG icon into multiple themes, and generate React icons,Vue icons,svg icons
TypeScript
8,298
star
2

xgplayer

A HTML5 video player with a parser that saves traffic
JavaScript
8,260
star
3

sonic

A blazingly fast JSON serializing & deserializing library
Assembly
6,870
star
4

monoio

Rust async runtime based on io-uring.
Rust
3,864
star
5

byteps

A high performance and generic framework for distributed DNN training
Python
3,603
star
6

lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation
C++
3,193
star
7

ByteX

ByteX is a bytecode plugin platform based on Android Gradle Transform API and ASM. 字节码插件开发平台
Java
2,865
star
8

Elkeid

Elkeid is an open source solution that can meet the security requirements of various workloads such as hosts, containers and K8s, and serverless. It is derived from ByteDance's internal best practices.
Go
2,226
star
9

AlphaPlayer

AlphaPlayer is a video animation engine.
Java
2,181
star
10

scene

Android Single Activity Framework compatible with Fragment.
Java
2,097
star
11

bhook

🔥 ByteHook is an Android PLT hook library which supports armeabi-v7a, arm64-v8a, x86 and x86_64.
C
2,073
star
12

flutter_ume

UME is an in-app debug kits platform for Flutter. Produced by Flutter Infra team of ByteDance
Dart
2,053
star
13

terarkdb

A RocksDB compatible KV storage engine with better performance
C++
2,044
star
14

btrace

🔥🔥 btrace(AKA RheaTrace) is a high performance Android trace tool which is based on Perfetto, it support to define custom events automatically during building apk and using bhook to provider more native events like Render/Binder/IO etc.
Kotlin
1,913
star
15

gopkg

Universal Utilities for Go
Go
1,704
star
16

android-inline-hook

🔥 ShadowHook is an Android inline hook library which supports thumb, arm32 and arm64.
C
1,660
star
17

bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Java
1,627
star
18

go-tagexpr

An interesting go struct tag expression syntax for field validation, etc.
Go
1,470
star
19

GiantMIDI-Piano

Python
1,431
star
20

appshark

Appshark is a static taint analysis platform to scan vulnerabilities in an Android app.
Kotlin
1,363
star
21

AabResGuard

The tool of obfuscated aab resources.(Android app bundle资源混淆工具)
Java
1,307
star
22

piano_transcription

Python
1,247
star
23

CodeLocator

Kotlin
1,163
star
24

BoostMultiDex

BoostMultiDex is a solution for quickly loading multiple dex files on low Android version devices (4.X and below, SDK <21).
Java
1,106
star
25

music_source_separation

Python
1,039
star
26

Fastbot_Android

Fastbot(2.0) is a model-based testing tool for modeling GUI transitions to discover app stability problems
C++
1,031
star
27

SALMONN

SALMONN: Speech Audio Language Music Open Neural Network
Python
1,000
star
28

memory-leak-detector

C
919
star
29

fedlearner

A multi-party collaborative machine learning framework
Python
892
star
30

monolith

ByteDance's Recommendation System
Python
844
star
31

sonic-cpp

A fast JSON serializing & deserializing library, accelerated by SIMD.
C++
811
star
32

godlp

sensitive information protection toolkit
Go
770
star
33

MVDream

Multi-view Diffusion for 3D Generation
Python
744
star
34

res-adapter

Official implementation of "ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models".
Python
724
star
35

bytemd

ByteMD v1 repository
TypeScript
679
star
36

tailor

C
669
star
37

ibot

iBOT 🤖: Image BERT Pre-Training with Online Tokenizer (ICLR 2022)
Jupyter Notebook
663
star
38

RealRichText

A Tricky Solution for Implementing Inline-Image-In-Text Feature in Flutter.
Dart
658
star
39

guide

A new feature guide component by react 🧭
TypeScript
651
star
40

mockey

a simple and easy-to-use golang mock library
Go
622
star
41

magic-microservices

Make Web Components easier and powerful!😘
TypeScript
570
star
42

Fastbot_iOS

About Fastbot(2.0) is a model-based testing tool for modeling GUI transitions to discover app stability problems
Objective-C
553
star
43

flow-builder

A highly customizable streaming flow builder.
TypeScript
526
star
44

MVDream-threestudio

3D generation code for MVDream
Python
473
star
45

effective_transformer

Running BERT without Padding
C++
457
star
46

ByteTransformer

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
C++
449
star
47

matxscript

A high-performance, extensible Python AOT compiler.
C++
408
star
48

byteir

A model compilation solution for various hardware
MLIR
362
star
49

syllepsis

Syllepsis is an out-of-the-box rich text editor.
TypeScript
355
star
50

uss

This is the PyTorch implementation of the Universal Source Separation with Weakly labelled Data.
Python
324
star
51

OMGD

Online Multi-Granularity Distillation for GAN Compression (ICCV2021)
Python
323
star
52

neurst

Neural end-to-end Speech Translation Toolkit
Python
298
star
53

danmu.js

HTML5 danmu (danmaku) plugin for any DOM element
JavaScript
292
star
54

vArmor

vArmor is a cloud native container sandbox system based on AppArmor/BPF/Seccomp. It also includes multiple built-in protection rules that are ready to use out of the box.
Go
263
star
55

particle-sfm

ParticleSfM: Exploiting Dense Point Trajectories for Localizing Moving Cameras in the Wild. ECCV 2022.
C++
263
star
56

CloudShuffleService

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.
Java
245
star
57

lynx-llm

paper: https://arxiv.org/abs/2307.02469 page: https://lynx-llm.github.io/
Python
227
star
58

g3

Enterprise-oriented Generic Proxy Solutions
Rust
227
star
59

xgplayer-vue

Vue component for xgplayer, a HTML5 video player with a parser that saves traffic
JavaScript
219
star
60

DEADiff

[CVPR 2024] Official implementation of "DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations"
Python
209
star
61

flux

A fast communication-overlapping library for tensor parallelism on GPUs.
C++
201
star
62

trace-irqoff

Interrupts-off or softirqs-off latency tracer
C
195
star
63

ParaGen

ParaGen is a PyTorch deep learning framework for parallel sequence generation.
Python
186
star
64

ByteMLPerf

AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware.
Python
181
star
65

MoMA

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Jupyter Notebook
177
star
66

AWERTL

An non-invasive iOS framework for quickly adapting Right-To-Left style UI
Objective-C
175
star
67

Bytedance-UnionAD

Ruby
170
star
68

keyhouse

Keyhouse is a skeleton of general-purpose Key Management System written in Rust.
Rust
163
star
69

react-model

The next generation state management library for React
TypeScript
162
star
70

LargeBatchCTR

Large batch training of CTR models based on DeepCTR with CowClip.
Python
162
star
71

ic_flow_platform

IFP (ic flow platform) is an integrated circuit design flow platform, mainly used for IC process specification management and data flow contral.
Python
154
star
72

DanmakuRenderEngine

DanmakuRenderEngine is a lightweight and scalable Android danmaku library. 轻量级高扩展安卓弹幕渲染引擎
Kotlin
149
star
73

primus

Java
148
star
74

diat

A CLI tool to help with diagnosing Node.js processes basing on inspector.
JavaScript
146
star
75

coconut_cvpr2024

Jupyter Notebook
143
star
76

Hammer

An efficient toolkit for training deep models.
Python
138
star
77

ns-x

An easy-to-use, flexible network simulator library in Go.
Go
116
star
78

pv3d

Python
113
star
79

fc-clip

This repo contains the code for our paper Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
Python
109
star
80

RLFN

Winner of runtime track in NTIRE 2022 challenge on Efficient Super-Resolution
Python
106
star
81

DCFrame

DCFrame is a Swift UI collection framework, which can easily create complex UI.
Swift
100
star
82

trace-noschedule

Trace noschedule thread
C
99
star
83

decoupleQ

A quantization algorithm for LLM
Cuda
99
star
84

tar-wasm

A faster experimental wasm-based tar implementation for browsers.
Rust
95
star
85

TWIST

Official codes: Self-Supervised Learning by Estimating Twin Class Distribution
Python
95
star
86

magic-portal

⚡ A blazing fast micro-component and micro-frontend solution uses web-components under the hood.
TypeScript
91
star
87

xgplayer-react

React component for xgplayer, a HTML5 video player with a parser that saves traffic
JavaScript
84
star
88

fe-foundation

UI Foundation for React Hooks and Vue Composition Api
TypeScript
80
star
89

nnproxy

Scalable NameNode RPC Proxy for HDFS Federation
Java
79
star
90

dbatman

Go
74
star
91

Elkeid-HUB

Elkeid HUB is a rule/event processing engine maintained by the Elkeid Team that supports streaming/offline (not yet supported by the community edition) data processing. The original intention is to solve complex data/event processing and external system linkage requirements through standardized rules.
Python
74
star
92

FreeSeg

Python
69
star
93

pull_to_refresh

Flutter pull_to_refresh widget
Dart
67
star
94

Jeddak-DPSQL

DPSQL (Privacy Protection SQL Query Service) - This project is a microservice Middleware located between the database engine ( Hive , Clickhouse , etc.) and the application system. It provides transparent SQL query result desensitization capabilities.
Python
62
star
95

terark-zip

A data structure and algorithm library built for TerarkDB
C++
62
star
96

trace-runqlat

C
61
star
97

ipmb

An interprocess message bus system built in Rust.
Rust
60
star
98

X-Portrait

Source code for the SIGGRAPH 2024 paper "X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention"
Python
59
star
99

kernel

ByteDance kernel for use on cloud.
C
57
star
100

scroll_kit

Dart
56
star