• Stars
    star
    152
  • Rank 243,260 (Top 5 %)
  • Language
    Python
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The official repo for [ECCV'22] "VSA: Learning Varied-Size Window Attention in Vision Transformers"

[ECCV 2022] VSA: Learning Varied-Size Window Attention in Vision Transformers

Updates | Introduction | Statement |

Current applications

Classification: Please see ViTAE-VSA for Image Classification for usage detail;

Object Detection: Please see ViTAE-VSA for Object Detection for usage detail;

Semantic Segmentation: Will be released in next few days;

Other ViTAE applications

ViTAE & ViTAEv2: Please see ViTAE-Transformer for Image Classification, Object Detection, and Sementic Segmentation;

Matting: Please see ViTAE-Transformer for matting;

Remote Sensing: Please see ViTAE-Transformer for Remote Sensing;

Updates

19/09/2022

  • The code and training logs for ViTAE-VSA have been released! The semantic segmentation and Swin+VSA will be relased in next few days.

09/07/2022

  • The paper is accepted by ECCV'22!

19/04/2022

  • The paper is post on arxiv! The code will be made public available once cleaned up.

Introduction

This repository contains the code, models, test results for the paper VSA: Learning Varied-Size Window Attention in Vision Transformers. We design a novel varied-size window attention module which learns adaptive window configurations from data. By adopting VSA in each head independently, the model can capture long-range dependencies and rich context information from diverse windows. VSA can replace the window attention in SOTA methods and faciliate the learning on various vision tasks including classification, detection and segmentation.

Fig.1 - The comparison of the current design (hand-crafted windows) and VSA.

Fig.2 - The architecture of VSA .

Usage

If you are interested in using the VSA attention only, please consider this file in classification or the VSAWindowAttention Class in object detection applications.

Classification Results

ViTAEv2* denotes the version using window attention for all stages, which have much less memory requirements anc computations.

Main Results on ImageNet-1K with pretrained models

name resolution acc@1 acc@5 acc@RealTop-1 Pretrained
Swin-T 224x224 81.2 \ \ \
Swin-T+VSA 224x224 82.24 95.8 \ Coming Soon
ViTAEv2*-S 224x224 82.2 96.1 87.5 \
ViTAEv2-S 224x224 82.6 96.2 87.6 weights&logs
ViTAEv2*-S+VSA 224x224 82.7 96.3 87.7 weights&logs
Swin-S 224x224 83.0 \ \ \
Swin-S+VSA 224x224 83.6 96.6 \ Coming Soon
ViTAEv2*-48M+VSA 224x224 83.9 96.6 \ weights&logs

Models with ImageNet-22K pretraining

name resolution acc@1 acc@5 acc@RealTop-1 Pretrained
ViTAEv2*-48M+VSA 224x224 84.9 97.4 \ Coming Soon
ViTAEv2*-B+VSA 224x224 86.2 97.9 90.0 Coming Soon

Object Detection Results

ViTAEv2* denotes the version using window attention for all stages, which have much less memory requirements anc computations.

Mask R-CNN

Backbone Pretrain Lr Schd box mAP mask mAP #params config log model
ViTAEv2*-S ImageNet-1K 1x 43.5 39.4 37M \ \ \
ViTAEv2-S ImageNet-1K 1x 46.3 41.8 37M config github Coming Soon
ViTAEv2*-S+VSA ImageNet-1K 1x 45.9 41.4 37M config github coming soon
ViTAEv2*-S ImageNet-1K 3x 44.7 40.0 39M \ \ \
ViTAEv2-S ImageNet-1K 3x 47.8 42.6 37M config github Coming Soon
ViTAEv2*-S+VSA ImageNet-1K 3x 48.1 42.9 39M config github Coming Soon
ViTAEv2*-48M+VSA ImageNet-1K 3x 49.9 44.2 69M config github Coming Soon

Cascade Mask R-CNN

Backbone Pretrain Lr Schd box mAP mask mAP #params config log model
ViTAEv2*-S ImageNet-1K 1x 47.3 40.6 77M \ \ \
ViTAEv2-S ImageNet-1K 1x 50.6 43.6 75M config github Coming Soon
ViTAEv2*-S+VSA ImageNet-1K 1x 49.8 43.0 77M config github Coming Soon
ViTAEv2*-S ImageNet-1K 3x 48.0 41.3 77M \ \ \
ViTAEv2-S ImageNet-1K 3x 51.4 44.5 75M config github Coming Soon
ViTAEv2*-S+VSA ImageNet-1K 3x 51.9 44.8 77M config github Coming Soon
ViTAEv2*-48M+VSA ImageNet-1k 3x 52.9 45.6 108M config github coming soon

Semantic Segmentation Results for Cityscapes

ViTAEv2* denotes the version using window attention for all stages.

UperNet

512x1024 resolution for training and testing

Backbone Pretrain Lr Schd mIoU mIoU* #params config log model
Swin-T ImageNet-1k 40k 78.9 79.9 \ \ \ \
Swin-T+VSA ImageNet-1k 40k 80.8 81.7 \ \ \ \
ViTAEv2*-S ImageNet-1k 40k 80.1 80.9 \ \ \ \
ViTAEv2*-S+VSA ImageNet-1k 40k 81.4 82.3 \ \ \ \
Swin-T ImageNet-1k 80k 79.3 80.2 \ \ \ \
Swin-T+VSA ImageNet-1k 80k 81.6 82.4 \ \ \ \
ViTAEv2*-S ImageNet-1k 80k 80.8 81.0 \ \ \ \
ViTAEv2*-S+VSA ImageNet-1k 80k 82.2 83.0 \ \ \ \

769x769 resolution for training and testing

Backbone Pretrain Lr Schd mIoU ms mIoU #params config log model
Swin-T ImageNet-1k 40k 79.3 80.1 \ \ \ \
Swin-T+VSA ImageNet-1k 40k 81.0 81.9 \ \ \ \
ViTAEv2*-S ImageNet-1k 40k 79.6 80.6 \ \ \ \
ViTAEv2*-S+VSA ImageNet-1k 40k 81.7 82.5 \ \ \ \
Swin-T ImageNet-1k 80k 79.6 80.1 \ \ \ \
Swin-T+VSA ImageNet-1k 80k 81.6 82.5 \ \ \ \

Please refer to our paper for more experimental results.

Statement

This project is for research purpose only. For any other questions please contact qmzhangzz at hotmail.com yufei.xu at outlook.com.

The code base is borrowed from T2T, ViTAEv2 and Swin.

Citing VSA and ViTAE

@article{zhang2022vsa,
  title={VSA: Learning Varied-Size Window Attention in Vision Transformers},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2204.08446},
  year={2022}
}
@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}
@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

More Repositories

1

ViTPose

The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"
Python
1,313
star
2

ViTDet

Unofficial implementation for [ECCV'22] "Exploring Plain Vision Transformer Backbones for Object Detection"
Python
524
star
3

ViTAE-Transformer-Remote-Sensing

A comprehensive list [SAMRS@NeurIPS'23, RVSA@TGRS'22, RSP@TGRS'22] of our research works related to remote sensing, including papers, codes, and citations. Note: The repo for [TGRS'22] "An Empirical Study of Remote Sensing Pretraining" has been moved to: https://github.com/ViTAE-Transformer/RSP
TeX
446
star
4

Remote-Sensing-RVSA

The official repo for [TGRS'22] "Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model"
Python
403
star
5

SAMRS

The official repo for [NeurIPS'23] "SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model"
Python
263
star
6

ViTAE-Transformer

The official repo for [NeurIPS'21] "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias" and [IJCV'22] "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond"
Python
249
star
7

ViTAE-Transformer-Matting

A comprehensive list [AIM@IJCAI'21, P3M@MM'21, GFM@IJCV'22, RIM@CVPR'23, P3MNet@IJCV'23] of our research works related to image matting, including papers, codes, datasets, demos, and citations. Note: The repo for [IJCV'23] "Rethinking Portrait Matting with Privacy Preserving" has been moved to: https://github.com/ViTAE-Transformer/P3M-Net
TeX
229
star
8

QFormer

The official repo for [TPAMI'23] "Vision Transformer with Quadrangle Attention"
Python
158
star
9

MTP

The official repo for [JSTARS'24] "MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining"
Python
140
star
10

RSP

The official repo for [TGRS'22] "An Empirical Study of Remote Sensing Pretraining"
Python
130
star
11

P3M-Net

The official repo for [IJCV'23] "Rethinking Portrait Matting with Privacy Preserving"
Python
90
star
12

DeepSolo

[CVPR 2023] DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting
Python
68
star
13

ViTAE-Transformer-Scene-Text-Detection

The official repo for [IJCV'22] I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection
Python
37
star
14

LeMeViT

The official repo for [IJCAI'24] "LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation"
Python
37
star
15

SimDistill

The official repo for [AAAI 2024] "SimDistill: Simulated Multi-modal Distillation for BEV 3D Object Detection""
Python
22
star
16

VOS-LLB

The official repo for [AAAI'23] "Learning to Learn Better for Video Object Segmentation"
Python
10
star
17

APTv2

The official repo for the extension of [NeurIPS'22] "APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking": https://github.com/pandorgan/APT-36K
Python
9
star
18

I3CL

The official repo for [IJCV'22] "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection"
Python
2
star