• Stars
    star
    305
  • Rank 136,100 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 3 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An Extendable, Efficient and Effective Transformer-based Object Detector (Extension of VIDT published at ICLR2022)

Please see the vidt branch if you are interested in the vanilla ViDT model.
This is an extension of ViDT for joint-learning of object detection and instance segmentation.

by Hwanjun Song1, Deqing Sun2, Sanghyuk Chun1, Varun Jampani2, Dongyoon Han1,
Byeongho Heo1, Wonjae Kim1, and Ming-Hsuan Yang2,3

1 NAVER AI Lab, 2 Google Research, 3 University California Merced

  • April 6, 2022: The official code is released!
    We obtained a light-weight transformer-based detector, achieving 47.0AP only with 14M parameters and 41.9 FPS (NVIDIA A100).
    See Complete Analysis.
  • April 19, 2022: The preprint is uploaded, See [here]!

ViDT+ for Joint-learning of Object Detection and Instance Segmentation

Extension to ViDT+

We extend ViDT into ViDT+, supporting a joint-learning of object detection and instance segmentation in an end-to-end manner. Three new components have been leveraged for extensions: (1) An efficient pyramid feature fusion (EPFF) module, (2) An unified query representation module, and (3) two auxiliary losses of IoU-aware and token labeling. Compared with the vanilla ViDT, ViDT+ provides a significant performance improvement without comprising inference speed. Only 1M parameters are added into the model.

Evaluation

Index: [A. ViT Backbone], [B. Main Results], [C. Complete Analysis]

|--- A. ViT Backbone used for ViDT
|--- B. Main Results in the ViDT+ Paper
     |--- B.1. VIDT+ compared with the vanilla ViDT for Object Detection
     |--- B.2. VIDT+ compared with other CNN-based methods for Object Detection and Instance Segmentation
|--- C. Complete Component Analysis

A. ViT Backbone used for ViDT+

Backbone and Size Training Data Epochs Resulution Params ImageNet Acc. Checkpoint
Swin-nano ImageNet-1K 300 224 6M 74.9% Github
Swin-tiny ImageNet-1K 300 224 28M 81.2% Github
Swin-small ImageNet-1K 300 224 50M 83.2% Github
Swin-base ImageNet-22K 90 224 88M 86.3% Github

B. Main Results in the ViDT+ Paper

All the models were re-trained with the final version of source codes. Thus, the value may be very slightly different from those in the paper. Note that a single 'NVIDIA A100 GPU' was used to compute FPS for the input of batch size 1.
Compared with the vailla version, ViDT+ leverages three additional components or techniques:
(1) An efficient pyramid feature fusion (EPFF) module.
(2) An unified query representation moudle (UQR).
(3) Two additional losses of IoU-aware loss and token-labeling loss.

B.1. VIDT+ compared with the vanilla ViDT for Object Detection
Method Backbone Epochs AP AP50 AP75 AP_S AP_M AP_L Params FPS Checkpoint / Log
ViDT+ Swin-nano 50 45.3 62.3 48.9 27.3 48.2 61.5 16M 37.6 Github / Log
ViDT+ Swin-tiny 50 49.7 67.7 54.2 31.6 53.4 65.9 38M 30.4 Github / Log
ViDT+ Swin-small 50 51.2 69.5 55.9 33.8 54.5 67.8 61M 20.6 Github / Log
ViDT+ Swin-base 50 53.2 71.6 58.3 36.0 57.1 69.2 100M 19.3 Github / Log
Method Backbone Epochs AP AP50 AP75 AP_S AP_M AP_L Params FPS Checkpoint / Log
ViDT Swin-nano 50 40.4 59.9 43.0 23.1 42.8 55.9 15M 40.8 Github / Log
ViDT Swin-tiny 50 44.9 64.7 48.3 27.5 47.9 61.9 37M 33.5 Github / Log
ViDT Swin-small 50 47.4 67.7 51.2 30.4 50.7 64.6 60M 24.7 Github / Log
ViDT Swin-base 50 49.4 69.6 53.4 31.6 52.4 66.8 99M 20.5 Github / Log
B.2. VIDT+ compared with other CNN-based methods for Object Detection and Instance Segmentation

For fair comparison w.r.t the number of parameters, Swin-tiny and Swin-small backbones are used for ViDT+, which have similar number of parameters to ResNet-50 and ResNet-101, respectively.
ViDT+ shows much higher detection AP than other joint-learning methods, but its segmentation AP is only higher than others for the medium- and large-size objects in general.

Method Backbone Epochs Box AP Mask AP Mask AP_S Mask AP_M Mask AP_L
Mask R-CNN ResNet-50 + FPN 36 41.3 37.5 21.1 39.6 48.3
HTC ResNet-50 + FPN 36 44.9 39.7 22.6 42.2 50.6
SOLOv2 ResNet-50 + FPN 72 40.4 38.8 16.5 41.7 56.2
QueryInst ResNet-50 + FPN 36 45.6 40.6 23.4 42.5 52.8
SOLQ ResNet-50 50 47.8 39.7 21.5 42.5 53.1
ViDT+ Swin-tiny 50 49.7 39.5 21.5 43.4 58.2
Method Backbone Epochs Box AP Mask AP Mask AP_S Mask AP_M Mask AP_L
Mask R-CNN ResNet-101 + FPN 50 41.3 38.8 21.8 41.4 50.5
HTC ResNet-101 + FPN 50 44.3 40.8 23.0 43.5 58.2
SOLOv2 ResNet-101 + FPN 50 42.6 39.7 17.3 42.9 58.2
QueryInst ResNet-101 + FPN 50 48.1 42.8 24.6 45.0 58.2
SOLQ ResNet-101 50 48.7 40.9 22.5 43.8 58.2
ViDT+ Swin-small 50 51.2 40.8 22.6 44.3 60.1

C. Complete Component Analysis

We combined the four proposed components (even with distillation with token matching and decoding layer drop) to achieve high accuracy and speed for object detection. For distillation, ViDT (Swin-base) trained for 50 epochs was used for all models.

We combined all the proposed components (even with longer training epochs and decoding layer dropping) to achive high accuracy and speed for object detection. As summarized in below table, there are eight components for extension: (1) RAM, (2) the neck decoder, (3) the IoU-aware and token labeling losses, (4) the EPFF module, (5) the UQR module, (6) the use of more detection tokens, (6) the use of longer training epochs, and (8) decoding layer drop.

The numbers (2), (6), and (8) are the performance of the vanilla ViDT, its extension to ViDT+, and the fully optimized ViDT+.

Added Swin-nano Swin-tiny Swin-small
# Module AP Params FPS AP Params FPS AP Params FPS
(1) + RAM 28.7 7M 72.4 36.3 29M 51.8 41.6 52M 33.5
(2) + Encoder-free Neck 40.4 15M 40.8 44.8 37M 33.5 47.5 60M 24.7
(3) + IoU-aware & Token Label 41.0 15M 40.8 45.9 37M 33.5 48.5 60M 24.7
(4) + EPFF Module 42.5 16M 38.0 47.1 38M 30.9 49.3 61M 23.0
(5) + UQR Module 43.9 16M 38.0 47.9 38M 30.9 50.1 61M 23.0
(6) + 300 [DET] Tokens 45.3 16M 37.6 49.7 38M 30.4 51.2 61M 22.6
(7) + 150 Training Epochs 47.6 16M 37.6 51.4 38M 30.4 52.3 61M 22.6
(8) + Decoding Layer Drop 47.0 14M 41.9 50.8 36M 33.9 51.8 59M 24.6

The optimized ViDT+ models can be found:
ViDT+ (Swin-nano), ViDT+ (Swin-tiny), and ViDT+ (Swin-small).

Requirements

This codebase has been developed with the setting used in Deformable DETR:
Linux, CUDA>=9.2, GCC>=5.4, Python>=3.7, PyTorch>=1.5.1, and torchvision>=0.6.1.

We recommend you to use Anaconda to create a conda environment:

conda create -n deformable_detr python=3.7 pip
conda activate deformable_detr
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch

Compiling CUDA operators for deformable attention

cd ./ops
sh ./make.sh
# unit test (should see all checking is True)
python test.py

Other requirements

pip install -r requirements.txt

Training and Evaluation

If you want to test with a single GPU, see colab examples. Thanks to EherSenaw for making this example.
The below codes are for training with multi GPUs.

Training for ViDT+

We used the below commands to train ViDT+ models with a single node having 8 NVIDIA GPUs.

Run this command to train the ViDT+ (Swin-nano) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT+ (Swin-tiny) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT+ (Swin-small) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT+ (Swin-base) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Evaluation for ViDT+

Run this command to evaluate the ViDT+ (Swin-nano) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \ 
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_nano \
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT+ (Swin-tiny) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_tiny\
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT+ (Swin-small) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_small \
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT+ (Swin-base) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_base \
       --pre_trained none \
       --eval True

Training for ViDT

We used the below commands to train ViDT models with a single node having 8 NVIDIA GPUs.

Run this command to train the ViDT (Swin-nano) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT (Swin-tiny) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT (Swin-small) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output
Run this command to train the ViDT (Swin-base) model in the paper :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Evaluation for ViDT

Run this command to evaluate the ViDT (Swin-nano) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \ 
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_nano \
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT (Swin-tiny) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_tiny\
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT (Swin-small) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_small \
       --pre_trained none \
       --eval True
Run this command to evaluate the ViDT (Swin-base) model on COCO :

python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_base \
       --pre_trained none \
       --eval True

Citation

Please consider citation if our paper is useful in your research.

@inproceedings{song2022vidt,
  title={ViDT: An Efficient and Effective Fully Transformer-based Object Detector},
  author={Song, Hwanjun and Sun, Deqing and Chun, Sanghyuk and Jampani, Varun and Han, Dongyoon and Heo, Byeongho and Kim, Wonjae and Yang, Ming-Hsuan},
  booktitle={International Conference on Learning Representation},
  year={2022}
}
@article{song2022vidtplus,
  title={An Extendable, Efficient and Effective Transformer-based Object Detector},
  author={Song, Hwanjun and Sun, Deqing and Chun, Sanghyuk and Jampani, Varun and Han, Dongyoon and Heo, Byeongho and Kim, Wonjae and Yang, Ming-Hsuan},
  journal={arXiv preprint arXiv:2204.07962},
  year={2022}
}

License

Copyright 2021-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

More Repositories

1

DenseDiffusion

Official Pytorch Implementation of DenseDiffusion (ICCV 2023)
Jupyter Notebook
466
star
2

StyleMapGAN

Official pytorch implementation of StyleMapGAN (CVPR 2021)
Python
458
star
3

Visual-Style-Prompting

Official Pytorch implementation of "Visual Style Prompting with Swapping Self-Attention"
Python
403
star
4

relabel_imagenet

Python
395
star
5

pit

Python
240
star
6

korean-safety-benchmarks

Official datasets and pytorch implementation repository of SQuARe and KoSBi (ACL 2023)
Python
233
star
7

BlendNeRF

Official pytorch implementation of BlendNeRF (ICCV 2023)
Python
149
star
8

c3-gan

Official Pytorch implementation of C3-GAN (Spotlight at ICLR 2022)
Python
125
star
9

rope-vit

[ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"
Python
124
star
10

pcme

Official Pytorch implementation of "Probabilistic Cross-Modal Embedding" (CVPR 2021)
Python
121
star
11

GGDR

Official Pytorch implementation of GGDR (ECCV 2022)
Python
102
star
12

cl-vs-mim

(ICLR 2023) Official PyTorch implementation of "What Do Self-Supervised Vision Transformers Learn?"
Jupyter Notebook
97
star
13

calm

Python
91
star
14

PfLayer

Learning Features with Parameter-Free Layers, ICLR 2022
Python
85
star
15

rdnet

[ECCV2024] Official implementation of paper, "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs".
Python
84
star
16

w-ood

Python
81
star
17

model-stock

Model Stock: All we need is just a few fine-tuned models
72
star
18

hypermix

Code for text augmentation method leveraging large-scale language models
Python
60
star
19

carecall-corpus

CareCall for Seniors: Role Specified Open-Domain Dialogue dataset generated by leveraging LLMs (NAACL 2022).
59
star
20

eccv-caption

Extended COCO Validation (ECCV) Caption dataset (ECCV 2022)
Python
52
star
21

i-Blurry

Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)
Python
51
star
22

seit

Python
50
star
23

FSMR

Official Tensorflow implementation of "Feature Statistics Mixing Regularization for Generative Adversarial Networks" (CVPR 2022)
Python
49
star
24

pcmepp

Official Pytorch implementation of "Improved Probabilistic Image-Text Representations" (ICLR 2024)
Python
48
star
25

egtr

[CVPR 2024 Best paper award candidate] EGTR: Extracting Graph from Transformer for Scene Graph Generation
Python
46
star
26

cmo

Python
45
star
27

facetts

Python
44
star
28

cream

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models, EMNLP 2023
Python
42
star
29

dap-cl

Official code of "Generating Instance-level Prompts for Rehearsal-free Continual Learning (ICCV 2023)"
Python
39
star
30

NeglectedFreeLunch

Jupyter Notebook
36
star
31

neuralwoz

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)
Python
36
star
32

dual-teacher

Official code for the NeurIPS 2023 paper "Switching Temporary Teachers for Semi-Supervised Semantic Segmentation"
Python
35
star
33

augsub

Official PyTorch implementation of MaskSub "Masking Augmentation for Supervised Learning"
Python
32
star
34

chacha-chatbot

Python
31
star
35

carecall-memory

Keep Me Updated! Memory Management in Long-term Conversations (Findings of EMNLP 2022)
28
star
36

mid.metric

Python
27
star
37

tablevqabench

Jupyter Notebook
26
star
38

MetricMT

The official code repository for MetricMT - a reward optimization method for NMT with learned metrics
25
star
39

scob

Official Implementation of SCOB [ICCV 2023]
Python
22
star
40

ALMoST

Python
22
star
41

coco-annotation-tool

TypeScript
21
star
42

hmix-gmix

Jupyter Notebook
21
star
43

imagenet-annotation-tool

TypeScript
17
star
44

informer

17
star
45

cs-shortcut

Saving Dense Retriever from Shortcut Dependency in Conversational Search (EMNLP 2022)
Python
16
star
46

talebrush

The official source code for TaleBrush (CHI 2022)
Python
14
star
47

cgl_fairness

Python
14
star
48

KoBBQ

Official code and dataset repository of KoBBQ (TACL 2024)
Python
14
star
49

trace

TRACE: Table Reconstruction Aligned to Corner and Edges (ICDAR 2023)
Python
12
star
50

simseek

Generating Information-Seeking Conversations from Unlabeled Documents (EMNLP 2022).
Python
11
star
51

tc-clip

[ECCV 2024] Official PyTorch implementation of TC-CLIP "Leveraging Temporal Contextualization for Video Action Recognition"
Python
10
star
52

burn

Official Pytorch Implementation of Unsupervised Representation Learning for Binary Networks by Joint Classifier Training (CVPR 2022)
Python
10
star
53

tokenadapt

Python
8
star
54

llm-chatbot

The LLM chatbot demo website
HTML
7
star
55

lut

[ECCV 2024] Official PyTorch implementation of LUT "Learning with Unmasked Tokens Drives Stronger Vision Learners"
5
star
56

elva

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning
5
star
57

rewas

5
star
58

densediffusion

5
star
59

rite

Python
5
star
60

demystifying-ntk

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? (CVPR 2022)
Python
2
star
61

carte

CARTE: Cell Adjacency Relation for Table Evaluation
Python
2
star
62

chacha

TypeScript
1
star