Awesome Visual-Transformer
Collect some Transformer with Computer-Vision (CV) papers.
If you find some overlooked papers, please open issues or pull requests (recommended).
Papers
Transformer original paper
- Attention is All You Need (NIPS 2017)
Technical blog
- [English Blog] Transformers in Vision [Link]
- [Chinese Blog] 3W字长文带你轻松入门视觉transformer [Link]
- [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [Link]
Survey
- Multimodal learning with transformers: A survey (IEEE TPAMI) [paper] - 2023.05.11
- A Survey of Visual Transformers [paper] - 2021.11.30
- Transformers in Vision: A Survey [paper] - 2021.02.22
- A Survey on Visual Transformer [paper] - 2021.1.30
- A Survey of Transformers [paper] - 2020.6.09
arXiv papers
- Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive [paper]
- [FocusedDecoder] Focused Decoding Enables 3D Anatomical Detection by Transformers [paper] [code]
- [TAG] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [paper] [code]
- [FastMETRO] Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [paper] [code]
- BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [paper] [code]
- [RelViT] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [paper] [code]
- [MViTv2] Improved Multiscale Vision Transformers for Classification and Detection [paper] [code]
- DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection [paper] [code]
- Three things everyone should know about Vision Transformers [paper]
- [DeiT III] DeiT III: Revenge of the ViT [paper]
- [DaViT] DaViT: Dual Attention Vision Transformers [paper] [code]
- [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]
- [GSRTR] Grounded Situation Recognition with Transformers [paper] [code]
- [MaxViT] MaxViT: Multi-Axis Vision Transformer [paper]
- [V2X-ViT] V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [paper]
- [MemMC-MAE] Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder [paper] [code]
- Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection [paper] [code]
- [VideoMAE] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [paper] [code]
- PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [paper]
- ResViT: Residual vision transformers for multi-modal medical image synthesis [paper]
- [CrossEfficientViT] Combining EfficientNet and Vision Transformers for Video Deepfake Detection [paper] [code]
- [Discrete ViT] Discrete Representations Strengthen Vision Transformer Robustness [paper]
- [StyleSwin] StyleSwin: Transformer-based GAN for High-resolution Image Generation [paper] [code]
- [SReT] Sliced Recursive Transformer [paper] [code]
- Dynamic Token Normalization Improves Vision Transformer [paper]
- TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [paper] [code]
- Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [paper]
- [ORViT] Object-Region Video Transformers [paper] [code]
- Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation [paper] [code]
- [NViT] NViT: Vision Transformer Compression and Parameter Redistribution [paper]
- 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning [paper]
- Adversarial Token Attacks on Vision Transformers [paper]
- Contextual Transformer Networks for Visual Recognition [paper] [code]
- [TranSalNet] TranSalNet: Visual saliency prediction using transformers [paper]
- [MobileViT] MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [paper]
- A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition [paper]
- [3D-Transformer] 3D-Transformer: Molecular Representation with Transformer in 3D Space [paper]
- [CCTrans] CCTrans: Simplifying and Improving Crowd Counting with Transformer [paper]
- [UFO-ViT] UFO-ViT: High Performance Linear Vision Transformer without Softmax [paper]
- Sparse Spatial Transformers for Few-Shot Learning [paper]
- Vision Transformer Hashing for Image Retrieval [paper]
- [OH-Former] OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [paper]
- [Pix2seq] Pix2seq: A Language Modeling Framework for Object Detection [paper]
- [CoAtNet] CoAtNet: Marrying Convolution and Attention for All Data Sizes [paper]
- [LOTR] LOTR: Face Landmark Localization Using Localization Transformer [paper]
- Transformer-Unet: Raw Image Processing with Unet [paper]
- [GraFormer] GraFormer: Graph Convolution Transformer for 3D Pose Estimation [paper]
- [CDTrans] CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [paper]
- PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds [paper] [code]
- Anchor DETR: Query Design for Transformer-Based Detector [paper] [code]
- [DAB-DETR] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [paper] [code]
- [ESRT] Efficient Transformer for Single Image Super-Resolution [paper]
- [MaskFormer] MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation [paper] [code]
- [SwinIR] SwinIR: Image Restoration Using Swin Transformer [paper] [code]
- [Trans4Trans] Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance [paper]
- Do Vision Transformers See Like Convolutional Neural Networks? [paper]
- Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net [paper]
- Light Field Image Super-Resolution with Transformers [paper] [code]
- Focal Self-attention for Local-Global Interactions in Vision Transformers [paper] [code]
- Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers [paper] [code]
- Mobile-Former: Bridging MobileNet and Transformer [paper]
- [TriTransNet] TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network [paper]
- [PSViT] PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [paper]
- Boosting Few-shot Semantic Segmentation with Transformers [paper] [code]
- Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [paper]
- Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [paper]
- [Styleformer] Styleformer: Transformer based Generative Adversarial Networks with Style Vector [paper] [code]
- [CMT] CMT: Convolutional Neural Networks Meet Vision Transformers [paper]
- [TransAttUnet] TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation [paper]
- TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation [paper]
- [ViTGAN] ViTGAN: Training GANs with Vision Transformers [paper]
- What Makes for Hierarchical Vision Transformer? [paper]
- [Trans4Trans] Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World [paper]
- [FFVT] Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [paper]
- [TransformerFusion] TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [paper]
- Escaping the Big Data Paradigm with Compact Transformers [paper]
- How to train your ViT? Data, Augmentation,and Regularization in Vision Transformers [paper]
- Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [paper]
- [XCiT] XCiT: Cross-Covariance Image Transformers [paper] [code]
- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [paper] [code]
- Video Swin Transformer [paper] [code]
- [VOLO] VOLO: Vision Outlooker for Visual Recognition [paper] [code]
- Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [paper]
- End-to-end Temporal Action Detection with Transformer [paper] [code]
- How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [paper]
- Efficient Self-supervised Vision Transformers for Representation Learning [paper]
- Space-time Mixing Attention for Video Transformer [paper]
- Transformed CNNs: recasting pre-trained convolutional layers with self-attention [paper]
- [CAT] CAT: Cross Attention in Vision Transformer [paper]
- Scaling Vision Transformers [paper]
- [DETReg] DETReg: Unsupervised Pretraining with Region Priors for Object Detection [paper] [code]
- Chasing Sparsity in Vision Transformers:An End-to-End Exploration [paper]
- [MViT] MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [paper]
- Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [paper]
- On Improving Adversarial Transferability of Vision Transformers [paper]
- Fully Transformer Networks for Semantic ImageSegmentation [paper]
- Visual Transformer for Task-aware Active Learning [paper] [code]
- Efficient Training of Visual Transformers with Small-Size Datasets [paper]
- Reveal of Vision Transformers Robustness against Adversarial Attacks [paper]
- Person Re-Identification with a Locally Aware Transformer [paper]
- [Refiner] Refiner: Refining Self-attention for Vision Transformers [paper]
- [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [paper]
- Video Instance Segmentation using Inter-Frame Communication Transformers [paper]
- Transformer in Convolutional Neural Networks [paper] [code]
- [Uformer] Uformer: A General U-Shaped Transformer for Image Restoration [paper] [code]
- Patch Slimming for Efficient Vision Transformers [paper]
- [RegionViT] RegionViT: Regional-to-Local Attention for Vision Transformers [paper]
- Associating Objects with Transformers for Video Object Segmentation [paper] [code]
- Few-Shot Segmentation via Cycle-Consistent Transformer [paper]
- Glance-and-Gaze Vision Transformer [paper] [code]
- Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [paper]
- [DynamicViT] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper] [code]
- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [paper] [code]
- Unsupervised Out-of-Domain Detection via Pre-trained Transformers [paper]
- [TransMIL] TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [paper]
- [TransVOS] TransVOS: Video Object Segmentation with Transformers [paper]
- [KVT] KVT: k-NN Attention for Boosting Vision Transformers [paper]
- [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [paper] [code]
- [SegFormer] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [paper] [code]
- [SDNet] SDNet: mutil-branch for single image deraining using swin [paper] [code]
- [DVT] Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [paper]
- [GazeTR] Gaze Estimation using Transformer [paper] [code]
- Transformer-Based Deep Image Matching for Generalizable Person Re-identification [paper]
- Less is More: Pay Less Attention in Vision Transformers [paper]
- [FoveaTer] FoveaTer: Foveated Transformer for Image Classification [paper]
- [TransDA] Transformer-Based Source-Free Domain Adaptation [paper] [code]
- An Attention Free Transformer [paper]
- [PTNet] PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [paper]
- [ResT] ResT: An Efficient Transformer for Visual Recognition [paper] [code]
- [CogView] CogView: Mastering Text-to-Image Generation via Transformers [paper]
- [NesT] Aggregating Nested Transformers [paper]
- [TAPG] Temporal Action Proposal Generation with Transformers [paper]
- Boosting Crowd Counting with Transformers [paper]
- [COTR] COTR: Convolution in Transformer Network for End to End Polyp Detection [paper]
- [TransVOD] End-to-End Video Object Detection with Spatial-Temporal Transformers [paper] [code]
- Intriguing Properties of Vision Transformers [paper] [code]
- Combining Transformer Generators with Convolutional Discriminators [paper]
- Rethinking the Design Principles of Robust Vision Transformer [paper]
- Vision Transformers are Robust Learners [paper] [code]
- Manipulation Detection in Satellite Images Using Vision Transformer [paper]
- [Swin-Unet] Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [paper] [code]
- Self-Supervised Learning with Swin Transformers [paper] [code]
- [SCTN] SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [paper]
- [RelationTrack] RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [paper]
- [VGTR] Visual Grounding with Transformers [paper]
- [PST] Visual Composite Set Detection Using Part-and-Sum Transformers [paper]
- [TrTr] TrTr: Visual Tracking with Transformer [paper] [code]
- [MOTR] MOTR: End-to-End Multiple-Object Tracking with TRansformer [paper] [code]
- Attention for Image Registration (AiR): an unsupervised Transformer approach [paper]
- [TransHash] TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [paper]
- [ISTR] ISTR: End-to-End Instance Segmentation with Transformers [paper] [code]
- [CAT] CAT: Cross-Attention Transformer for One-Shot Object Detection [paper]
- [CoSformer] CoSformer: Detecting Co-Salient Object with Transformers [paper]
- End-to-End Attention-based Image Captioning [paper]
- [PMTrans] Pyramid Medical Transformer for Medical Image Segmentation [paper]
- [HandsFormer] HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [paper]
- [GasHis-Transformer] GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [paper]
- Emerging Properties in Self-Supervised Vision Transformers [paper]
- [InTra] Inpainting Transformer for Anomaly Detection [paper]
- [Twins] Twins: Revisiting Spatial Attention Design in Vision Transformers [paper] [code]
- [MLMSPT] Point Cloud Learning with Transformer [paper]
- Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [paper]
- [ConTNet] ConTNet: Why not use convolution and transformer at the same time? [paper] [code]
- [DTNet] Dual Transformer for Point Cloud Analysis [paper]
- Improve Vision Transformers Training by Suppressing Over-smoothing [paper] [code]
- Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [paper]
- [M3DeTR] M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [paper] [code]
- [Skeletor] Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [paper]
- [FaceT] Learning to Cluster Faces via Transformer [paper]
- [MViT] Multiscale Vision Transformers [paper] [code]
- [VATT] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [paper]
- [So-ViT] So-ViT: Mind Visual Tokens for Vision Transformer [paper] [code]
- Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [paper] [code]
- [TransRPPG] TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [paper]
- [VideoGPT] VideoGPT: Video Generation using VQ-VAE and Transformers [paper]
- [M2TR] M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [paper]
- Transformer Transforms Salient Object Detection and Camouflaged Object Detection [paper]
- [TransCrowd] TransCrowd: Weakly-Supervised Crowd Counting with Transformer [paper] [code]
- Visual Transformer Pruning [paper]
- Self-supervised Video Retrieval Transformer Network [paper]
- Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [paper]
- [TransGAN] TransGAN: Two Transformers Can Make One Strong GAN [paper] [code]
- Geometry-Free View Synthesis: Transformers and no 3D Priors [paper] [code]
- [CoaT] Co-Scale Conv-Attentional Image Transformers [paper] [code]
- [LocalViT] LocalViT: Bringing Locality to Vision Transformers [paper] [code]
- [CIT] Cloth Interactive Transformer for Virtual Try-On [paper] [code]
- Handwriting Transformers [paper]
- [SiT] SiT: Self-supervised vIsion Transformer [paper] [code]
- On the Robustness of Vision Transformers to Adversarial Examples [paper]
- An Empirical Study of Training Self-Supervised Visual Transformers [paper]
- A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [paper]
- [AOT-GAN] Aggregated Contextual Transformations for High-Resolution Image Inpainting [paper] [code]
- Deepfake Detection Scheme Based on Vision Transformer and Distillation [paper]
- [ATAG] Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [paper]
- [TubeR] TubeR: Tube-Transformer for Action Detection [paper]
- [AAformer] AAformer: Auto-Aligned Transformer for Person Re-Identification [paper]
- [TFill] TFill: Image Completion via a Transformer-Based Architecture [paper]
- Group-Free 3D Object Detection via Transformers [paper] [code]
- [STGT] Spatial-Temporal Graph Transformer for Multiple Object Tracking [paper]
- Going deeper with Image Transformers[paper]
- [Meta-DETR] Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [paper [code]
- [DA-DETR] DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [paper]
- Robust Facial Expression Recognition with Convolutional Visual Transformers [paper]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [paper]
- Spatiotemporal Transformer for Video-based Person Re-identification[paper]
- [TransUNet] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [paper] [code]
- [CvT] CvT: Introducing Convolutions to Vision Transformers [paper] [code]
- [TFPose] TFPose: Direct Human Pose Estimation with Transformers [paper]
- [TransCenter] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [paper]
- Face Transformer for Recognition [paper]
- On the Adversarial Robustness of Visual Transformers [paper]
- Understanding Robustness of Transformers for Image Classification [paper]
- Lifting Transformer for 3D Human Pose Estimation in Video [paper]
- [GSA-Net] Global Self-Attention Networks for Image Recognition[paper]
- High-Fidelity Pluralistic Image Completion with Transformers [paper] [code]
- [DPT] Vision Transformers for Dense Prediction [paper] [code]
- [TransFG] TransFG: A Transformer Architecture for Fine-grained Recognition? [paper]
- [TimeSformer] Is Space-Time Attention All You Need for Video Understanding? [paper]
- Multi-view 3D Reconstruction with Transformer [paper]
- Can Vision Transformers Learn without Natural Images? [paper] [code]
- End-to-End Trainable Multi-Instance Pose Estimation with Transformers [paper]
- Instance-level Image Retrieval using Reranking Transformers [paper] [code]
- [BossNAS] BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [paper] [code]
- [CeiT] Incorporating Convolution Designs into Visual Transformers [paper]
- [DeepViT] DeepViT: Towards Deeper Vision Transformer [paper]
- Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [paper]
- 3D Human Pose Estimation with Spatial and Temporal Transformers [paper] [code]
- [SUNETR] SUNETR: Transformers for 3D Medical Image Segmentation [paper]
- Scalable Visual Transformers with Hierarchical Pooling [paper]
- [ConViT] ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [paper]
- [TransMed] TransMed: Transformers Advance Multi-modal Medical Image Classification [paper]
- [U-Transformer] U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [paper]
- [SpecTr] SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [paper] [code]
- [TransBTS] TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [paper] [code]
- [SSTN] SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving [paper]
- Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [paper] [code]
- [CPVT] Do We Really Need Explicit Position Encodings for Vision Transformers? [paper] [code]
- Deepfake Video Detection Using Convolutional Vision Transformer[paper]
- Training Vision Transformers for Image Retrieval[paper]
- [VTN] Video Transformer Network[paper]
- [BoTNet] Bottleneck Transformers for Visual Recognition [paper]
- [CPTR] CPTR: Full Transformer Network for Image Captioning [paper]
- Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [paper] [code]
- [Trans2Seg] Segmenting Transparent Object in the Wild with Transformer [paper] [code]
- Investigating the Vision Transformer Model for Image Retrieval Tasks [paper]
- [Trear] Trear: Transformer-based RGB-D Egocentric Action Recognition [paper]
- [VisualSparta] VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [paper]
- [TrackFormer] TrackFormer: Multi-Object Tracking with Transformers [paper]
- [TAPE] Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [paper]
- [TRIQ] Transformer for Image Quality Assessment [paper] [code]
- [TransTrack] TransTrack: Multiple-Object Tracking with Transformer [paper] [code]
- [DeiT] Training data-efficient image transformers & distillation through attention [paper] [code]
- [Pointformer] 3D Object Detection with Pointformer [paper]
- [ViT-FRCNN] Toward Transformer-Based Object Detection [paper]
- [Taming-transformers] Taming Transformers for High-Resolution Image Synthesis [paper] [code]
- [SceneFormer] SceneFormer: Indoor Scene Generation with Transformers [paper]
- [PCT] PCT: Point Cloud Transformer [paper]
- [PED] DETR for Pedestrian Detection[paper]
- Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[paper]
- [C-Tran] General Multi-label Image Classification with Transformers [paper]
2022
TPAMI
- [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding [paper]
ECCV
- [X-CLIP] Expanding Language-Image Pretrained Models for General Video Recognition [paper] [code]
- [TinyViT] TinyViT: Fast Pretraining Distillation for Small Vision Transformers [paper] [code]
- [FastMETRO] Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [paper] [code]
- [AiATrack] AiATrack: Attention in Attention for Transformer Visual Tracking [paper] [code]
- [OSTrack] Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [paper] [code]
- [Unicorn] Towards Grand Unification of Object Tracking [paper] [code]
- [P3AFormer] Tracking Objects as Pixel-wise Distributions [paper] [code]
CVPR
- [MAE] Masked Autoencoders Are Scalable Vision Learners [paper] [code]
- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [paper] [code]
- Fast Point Transformer [paper]
- EDTER: Edge Detection With Transformer [paper] [code]
- Bridged Transformer for Vision and Point Cloud 3D Object Detection [paper]
- MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution [paper]
- HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening [paper] [code]
- Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation [paper]
- MPViT: Multi-Path Vision Transformer for Dense Prediction [paper] [code]
- A-ViT: Adaptive Tokens for Efficient Vision Transformer [paper]
- TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [paper] [code]
- Continual Learning With Lifelong Vision Transformer [paper]
- Swin Transformer V2: Scaling Up Capacity and Resolution [paper] [code]
- Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds [paper] [code]
- Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [paper]
- Human-Object Interaction Detection via Disentangled Transformer [paper]
- LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [paper]
- Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [paper]
- Vision Transformer With Deformable Attention [paper]
- DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [paper]
- [Restormer] Restormer: Efficient Transformer for High-Resolution Image Restoration [paper] [code]
- [SAM-DETR] Accelerating DETR Convergence via Semantic-Aligned Matching [paper] [code]
- [BEVT] BEVT: BERT Pretraining of Video Transformers [paper] [code]
- [MobileFormer] Mobile-Former: Bridging MobileNet and Transformer [paper]
- [STRM] Spatio-temporal Relation Modeling for Few-shot Action Recognition [paper] [code]
- [MiniViT] MiniViT: Compressing Vision Transformers with Weight Multiplexing [paper] [code]
- [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]
- [DW-ViT] Beyond Fixation: Dynamic Window Visual Transformer [paper] [code]
- [TokenFusion] Multimodal Token Fusion for Vision Transformers [paper]
- [CMT] Convolutional Neural Networks Meet Vision Transformers [paper]
- Fine-tuning Image Transformers using Learnable Memory [paper]
- [TransMix] Attend to Mix for Vision Transformers [paper] [code]
- [NomMer] Nominate Synergistic Context in Vision Transformer for Visual Recognition [paper] [code]
- [SSA] Shunted Self-Attention via Multi-Scale Token Aggregation [paper] [code]
- [RVT] Towards Robust Vision Transformer [paper [code]
- [LVT] Lite Vision Transformer with Enhanced Self-Attention [paper [code]
- [StyTr2] StyTr2: Image Style Transfer with Transformers [paper] [code]
WACV
ICLR
-
[RelViT] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [paper] [code]
-
[CrossFormer] CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention [paper] [code]
-
Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [paper] [code]
-
[DAB-DETR] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [paper] [code]
2021
NeurIPS
- ProTo: Program-Guided Transformer for Program-Guided Tasks [paper] [code]
- [Augvit] Augmented Shortcuts for Vision Transformers [paper] [code]
- [YOLOS] You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [paper] [code]
- [CATs] Semantic Correspondence with Transformers [paper] [code]
- [Moment-DETR] QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [paper] [code]
- Dual-stream Network for Visual Recognition [paper] [code]
- [Container] Container: Context Aggregation Network [paper] [code]
- [TNT] Transformer in Transformer [paper] [code]
- T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [paper]
- Long Short-Term Transformer for Online Action Detection [paper]
- TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [paper]
- TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification [paper]
- TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [paper]
- Associating Objects with Transformers for Video Object Segmentation [paper]
- Test-Time Personalization with a Transformer for Human Pose Estimation [paper]
- Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning [paper]
- Dynamic Grained Encoder for Vision Transformers [paper]
- HRFormer: High-Resolution Vision Transformer for Dense Predict [paper]
- Searching the Search Space of Vision Transformer [paper]
- Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [paper]
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [paper]
- Do Vision Transformers See Like Convolutional Neural Networks? [paper]
- Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [paper]
- Glance-and-Gaze Vision Transformer [paper]
- MST: Masked Self-Supervised Transformer for Visual Representation [paper]
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper]
- TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [paper]
- Augmented Shortcuts for Vision Transformers [paper]
- Improved Transformer for High-Resolution GANs [paper]
- All Tokens Matter: Token Labeling for Training Better Vision Transformers [paper]
- XCiT: Cross-Covariance Image Transformers [paper]
- Efficient Training of Visual Transformers with Small Datasets [paper]
ICCV
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Marr Prize) [paper] [code]
- [ICT] High-Fidelity Pluralistic Image Completion with Transformers [paper] [code]
- [PoinTr] PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers (oral) [paper] [code]
- [STTR] Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [paper] [code]
- [TSP-FCOS] Rethinking Transformer-based Set Prediction for Object Detection [paper]
- Paint Transformer: Feed Forward Neural Painting with Stroke Prediction (oral) ) [paper [code]
- 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [paper]
- [T2T-ViT] Training Vision Transformers from Scratch on ImageNet [paper] [code]
- [THUNDR] THUNDR: Transformer-Based 3D Human Reconstruction With Markers [paper]
- Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [paper]
- [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [paper] [code]
- Spatial-Temporal Transformer for Dynamic Scene Graph Generation [paper]
- [GLiT] GLiT: Neural Architecture Search for Global and Local Image Transformer [paper]
- [TRAR] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering [paper]
- [UniT] UniT: Multimodal Multitask Learning With a Unified Transformer [paper] [code]
- Stochastic Transformer Networks With Linear Competing Units: Application To End-to-End SL Translation [paper]
- Transformer-Based Dual Relation Graph for Multi-Label Image Recognition [paper]
- [LocalTrans] LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [paper]
- Improving 3D Object Detection With Channel-Wise Transformer [paper]
- A Latent Transformer for Disentangled Face Editing in Images and Videos [paper] [code]
- [GroupFormer] GroupFormer: Group Activity Recognition With Clustered Spatial-Temporal Transformer [paper]
- Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue [paper]
- [WB-DETR] WB-DETR: Transformer-Based Detector Without Backbone [paper]
- The Animation Transformer: Visual Correspondence via Segment Matching [paper]
- The Animation Transformer: Visual Correspondence via Segment Matching [paper]
- Relaxed Transformer Decoders for Direct Action Proposal Generation [paper]
- [PPT-Net] Pyramid Point Cloud Transformer for Large-Scale Place Recognition [paper] [code]
- Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images [paper]
- Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection [paper]
- Image Harmonization With Transformer [paper] [cpde]
- [COTR] COTR: Correspondence Transformer for Matching Across Images [paper]
- [MUSIQ] MUSIQ: Multi-Scale Image Quality Transformer [paper]
- Episodic Transformer for Vision-and-Language Navigation [paper]
- Action-Conditioned 3D Human Motion Synthesis With Transformer VAE [paper]
- [CrackFormer] CrackFormer: Transformer Network for Fine-Grained Crack Detection [paper]
- [HiT] HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval [paper]
- Event-Based Video Reconstruction Using Transformer [paper]
- [STVGBert] STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding [paper]
- [HiFT] HiFT: Hierarchical Feature Transformer for Aerial Tracking [paper] [code]
- [DocFormer] DocFormer: End-to-End Transformer for Document Understanding [paper]
- [LeViT] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [paper] [code]
- [SignBERT] SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition[paper]
- [VidTr] VidTr: Video Transformer Without Convolutions [paper]
- [ACTOR] Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [paper]
- [Segmenter] Segmenter: Transformer for Semantic Segmentation [paper] [code]
- [Visformer] Visformer: The Vision-friendly Transformer [paper] [code]
- [PnP-DETR] PnP-DETR: Towards Efficient Visual Analysis with Transformers (ICCV) [paper] [code]
- [VoTr] Voxel Transformer for 3D Object Detection [paper]
- [TransVG] TransVG: End-to-End Visual Grounding with Transformers [paper]
- [3DETR] An End-to-End Transformer Model for 3D Object Detection [paper] [code]
- [Eformer] Eformer: Edge Enhancement based Transformer for Medical Image Denoising [paper]
- [TransFER] TransFER: Learning Relation-aware Facial Expression Representations with Transformers [paper]
- [Oriented RCNN] Oriented Object Detection with Transformer [paper]
- [ViViT] ViViT: A Video Vision Transformer [paper]
- [Stark] Learning Spatio-Temporal Transformer for Visual Tracking [paper] [code]
- [CT3D] Improving 3D Object Detection with Channel-wise Transformer [paper]
- [VST] Visual Saliency Transformer [paper]
- [PiT] Rethinking Spatial Dimensions of Vision Transformers [paper] [code]
- [CrossViT] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [paper] [code]
- [PointTransformer] Point Transformer [paper]
- [TS-CAM] TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [paper] [code]
- [VTs] Visual Transformers: Token-based Image Representation and Processing for Computer Vision [paper]
- [TransDepth] Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction [paper] [code]
- [Conditional DETR] Conditional DETR for Fast Training Convergence [paper] [code]
- [PIT] PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [paper] [code]
- [SOTR] SOTR: Segmenting Objects with Transformers [paper] [code]
- [SnowflakeNet] SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer [paper] [code]
- [TransPose] TransPose: Keypoint Localization via Transformer [paper] [code]
- [TransReID] TransReID: Transformer-based Object Re-Identification [paper] [code]
- [CWT] Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer [paper] [code]
- Anticipative Video Transformer [paper] [code]
- Rethinking and Improving Relative Position Encoding for Vision Transformer [paper] [code]
- Vision Transformer with Progressive Sampling [paper] [code]
- [SMCA] Fast Convergence of DETR with Spatially Modulated Co-Attention [paper] [code]
- [AutoFormer] AutoFormer: Searching Transformers for Visual Recognition [paper] [code]
CVPR
- Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer [paper]
- [HOTR] HOTR: End-to-End Human-Object Interaction Detection with Transformers (oral) [paper]
- [METRO] End-to-End Human Pose and Mesh Reconstruction with Transformers [paper]
- [LETR] Line Segment Detection Using Transformers without Edges [paper]
- [TransFuser] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [paper] [code]
- Pose Recognition with Cascade Transformers [paper]
- Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [paper]
- [LoFTR] LoFTR: Detector-Free Local Feature Matching with Transformers [paper] [code]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [paper]
- [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [paper] [code]
- [TransT] Transformer Tracking [paper] [code]
- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (** oral**) [paper]
- [VisTR] End-to-End Video Instance Segmentation with Transformers [paper]
- Transformer Interpretability Beyond Attention Visualization [paper] [code]
- [IPT] Pre-Trained Image Processing Transformer [paper]
- [UP-DETR] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [paper]
- [IQT] Perceptual Image Quality Assessment with Transformers (workshop) [paper]
- High-Resolution Complex Scene Synthesis with Transformers (workshop) [paper]
- [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]
ICML
- Generative Video Transformer: Can Objects be the Words? [paper]
- [GANsformer] Generative Adversarial Transformers [paper] [code]
ICRA
- [NDT-Transformer] NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation [paper]
ICLR
- [VTNet] VTNet: Visual Transformer Network for Object Goal Navigation [paper]
- [Vision Transformer] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [paper] [code]
- [Deformable DETR] Deformable DETR: Deformable Transformers for End-to-End Object Detection [paper] [code]
- [LAMBDANETWORKS] MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION [paper] [code]
ACM MM
- Video Transformer for Deepfake Detection with Incremental Learning[paper]
- [HAT] HAT: Hierarchical Aggregation Transformers for Person Re-identification [paper]
- Token Shift Transformer for Video Classification [paper] [code]
- [DPT] DPT: Deformable Patch-based Transformer for Visual Recognition [paper] [code]
MICCAI
- [UTNet] UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [paper] [code]
- [MedT] Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [paper] [code]
- [MCTrans] Multi-Compound Transformer for Accurate Biomedical Image Segmentation [paper] [code]
- [PNS-Net] Progressively Normalized Self-Attention Network for Video Polyp Segmentation [paper] [code]
- [MBT-Net] A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation [paper]
BMVC
- [ACT] End-to-End Object Detection with Adaptive Clustering Transformer [paper]
- [GSRTR] Grounded Situation Recognition with Transformers [paper] [code]
- [TransFusion] TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation [paper] [code]
ISIE
- VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (ISIE) [paper]
CORL
- [DETR3D] DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [paper]
IJCAI
- Medical Image Segmentation using Squeeze-and-Expansion Transformers [paper]
IROS
- [YOGO] You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module (IROS) [paper] [code]
- [PTT] PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds [paper] [code]
WACV
ICDAR
- Vision Transformer for Fast and Efficient Scene Text Recognition [paper]
2020
- [DETR] End-to-End Object Detection with Transformers (ECCV) [paper] [code]
- [FPT] Feature Pyramid Transformer (CVPR) [paper] [code]
Other resource
Acknowledgement
Thanks the template from Awesome-Crowd-Counting