[ECCV 2022] VSA: Learning Varied-Size Window Attention in Vision Transformers

Current applications

Classification: Please see ViTAE-VSA for Image Classification for usage detail;

Object Detection: Please see ViTAE-VSA for Object Detection for usage detail;

Semantic Segmentation: Will be released in next few days;

Other ViTAE applications

ViTAE & ViTAEv2: Please see ViTAE-Transformer for Image Classification, Object Detection, and Sementic Segmentation;

Matting: Please see ViTAE-Transformer for matting;

Remote Sensing: Please see ViTAE-Transformer for Remote Sensing;

Updates

19/09/2022

The code and training logs for ViTAE-VSA have been released! The semantic segmentation and Swin+VSA will be relased in next few days.

09/07/2022

The paper is accepted by ECCV'22!

19/04/2022

The paper is post on arxiv! The code will be made public available once cleaned up.

Introduction

This repository contains the code, models, test results for the paper VSA: Learning Varied-Size Window Attention in Vision Transformers. We design a novel varied-size window attention module which learns adaptive window configurations from data. By adopting VSA in each head independently, the model can capture long-range dependencies and rich context information from diverse windows. VSA can replace the window attention in SOTA methods and faciliate the learning on various vision tasks including classification, detection and segmentation.

Fig.1 - The comparison of the current design (hand-crafted windows) and VSA.

Fig.2 - The architecture of VSA .

Usage

If you are interested in using the VSA attention only, please consider this file in classification or the VSAWindowAttention Class in object detection applications.

Classification Results

ViTAEv2* denotes the version using window attention for all stages, which have much less memory requirements anc computations.

Main Results on ImageNet-1K with pretrained models

name	resolution	acc@1	acc@5	acc@RealTop-1	Pretrained
Swin-T	224x224	81.2	\	\	\
Swin-T+VSA	224x224	82.24	95.8	\	Coming Soon
ViTAEv2*-S	224x224	82.2	96.1	87.5	\
ViTAEv2-S	224x224	82.6	96.2	87.6	weights&logs
ViTAEv2*-S+VSA	224x224	82.7	96.3	87.7	weights&logs
Swin-S	224x224	83.0	\	\	\
Swin-S+VSA	224x224	83.6	96.6	\	Coming Soon
ViTAEv2*-48M+VSA	224x224	83.9	96.6	\	weights&logs

Models with ImageNet-22K pretraining

name	resolution	acc@1	acc@5	acc@RealTop-1	Pretrained
ViTAEv2*-48M+VSA	224x224	84.9	97.4	\	Coming Soon
ViTAEv2*-B+VSA	224x224	86.2	97.9	90.0	Coming Soon

Object Detection Results

ViTAEv2* denotes the version using window attention for all stages, which have much less memory requirements anc computations.

Mask R-CNN

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	config	log	model
ViTAEv2*-S	ImageNet-1K	1x	43.5	39.4	37M	\	\	\
ViTAEv2-S	ImageNet-1K	1x	46.3	41.8	37M	config	github	Coming Soon
ViTAEv2*-S+VSA	ImageNet-1K	1x	45.9	41.4	37M	config	github	coming soon
ViTAEv2*-S	ImageNet-1K	3x	44.7	40.0	39M	\	\	\
ViTAEv2-S	ImageNet-1K	3x	47.8	42.6	37M	config	github	Coming Soon
ViTAEv2*-S+VSA	ImageNet-1K	3x	48.1	42.9	39M	config	github	Coming Soon
ViTAEv2*-48M+VSA	ImageNet-1K	3x	49.9	44.2	69M	config	github	Coming Soon

Cascade Mask R-CNN

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	config	log	model
ViTAEv2*-S	ImageNet-1K	1x	47.3	40.6	77M	\	\	\
ViTAEv2-S	ImageNet-1K	1x	50.6	43.6	75M	config	github	Coming Soon
ViTAEv2*-S+VSA	ImageNet-1K	1x	49.8	43.0	77M	config	github	Coming Soon
ViTAEv2*-S	ImageNet-1K	3x	48.0	41.3	77M	\	\	\
ViTAEv2-S	ImageNet-1K	3x	51.4	44.5	75M	config	github	Coming Soon
ViTAEv2*-S+VSA	ImageNet-1K	3x	51.9	44.8	77M	config	github	Coming Soon
ViTAEv2*-48M+VSA	ImageNet-1k	3x	52.9	45.6	108M	config	github	coming soon

Semantic Segmentation Results for Cityscapes

ViTAEv2* denotes the version using window attention for all stages.

UperNet

512x1024 resolution for training and testing

Backbone	Pretrain	Lr Schd	mIoU	mIoU*	#params	config	log	model
Swin-T	ImageNet-1k	40k	78.9	79.9	\	\	\	\
Swin-T+VSA	ImageNet-1k	40k	80.8	81.7	\	\	\	\
ViTAEv2*-S	ImageNet-1k	40k	80.1	80.9	\	\	\	\
ViTAEv2*-S+VSA	ImageNet-1k	40k	81.4	82.3	\	\	\	\
Swin-T	ImageNet-1k	80k	79.3	80.2	\	\	\	\
Swin-T+VSA	ImageNet-1k	80k	81.6	82.4	\	\	\	\
ViTAEv2*-S	ImageNet-1k	80k	80.8	81.0	\	\	\	\
ViTAEv2*-S+VSA	ImageNet-1k	80k	82.2	83.0	\	\	\	\

769x769 resolution for training and testing

Backbone	Pretrain	Lr Schd	mIoU	ms mIoU	#params	config	log	model
Swin-T	ImageNet-1k	40k	79.3	80.1	\	\	\	\
Swin-T+VSA	ImageNet-1k	40k	81.0	81.9	\	\	\	\
ViTAEv2*-S	ImageNet-1k	40k	79.6	80.6	\	\	\	\
ViTAEv2*-S+VSA	ImageNet-1k	40k	81.7	82.5	\	\	\	\
Swin-T	ImageNet-1k	80k	79.6	80.1	\	\	\	\
Swin-T+VSA	ImageNet-1k	80k	81.6	82.5	\	\	\	\

Please refer to our paper for more experimental results.

Statement

This project is for research purpose only. For any other questions please contact qmzhangzz at hotmail.com yufei.xu at outlook.com.

The code base is borrowed from T2T, ViTAEv2 and Swin.

Citing VSA and ViTAE

@article{zhang2022vsa,
  title={VSA: Learning Varied-Size Window Attention in Vision Transformers},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2204.08446},
  year={2022}
}
@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}
@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

ViTAE-Transformer/ViTAE-VSA

ViTAE-Transformer

Reviews

Repository Details