• Stars
    star
    3,015
  • Rank 14,977 (Top 0.3 %)
  • Language
  • Created almost 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

Awesome Visual-Transformer Awesome

Collect some Transformer with Computer-Vision (CV) papers.

If you find some overlooked papers, please open issues or pull requests (recommended).

Papers

Transformer original paper

Technical blog

  • [English Blog] Transformers in Vision [Link]
  • [Chinese Blog] 3W字长文带你轻松入门视觉transformer [Link]
  • [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [Link]

Survey

  • Multimodal learning with transformers: A survey (IEEE TPAMI) [paper] - 2023.05.11
  • A Survey of Visual Transformers [paper] - 2021.11.30
  • Transformers in Vision: A Survey [paper] - 2021.02.22
  • A Survey on Visual Transformer [paper] - 2021.1.30
  • A Survey of Transformers [paper] - 2020.6.09

arXiv papers

  • Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive [paper]
  • [FocusedDecoder] Focused Decoding Enables 3D Anatomical Detection by Transformers [paper] [code]
  • [TAG] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [paper] [code]
  • [FastMETRO] Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [paper] [code]
  • BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [paper] [code]
  • [RelViT] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [paper] [code]
  • [MViTv2] Improved Multiscale Vision Transformers for Classification and Detection [paper] [code]
  • DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection [paper] [code]
  • Three things everyone should know about Vision Transformers [paper]
  • [DeiT III] DeiT III: Revenge of the ViT [paper]
  • [DaViT] DaViT: Dual Attention Vision Transformers [paper] [code]
  • [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]
  • [GSRTR] Grounded Situation Recognition with Transformers [paper] [code]
  • [MaxViT] MaxViT: Multi-Axis Vision Transformer [paper]
  • [V2X-ViT] V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [paper]
  • [MemMC-MAE] Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder [paper] [code]
  • Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection [paper] [code]
  • [VideoMAE] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [paper] [code]
  • PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [paper]
  • ResViT: Residual vision transformers for multi-modal medical image synthesis [paper]
  • [CrossEfficientViT] Combining EfficientNet and Vision Transformers for Video Deepfake Detection [paper] [code]
  • [Discrete ViT] Discrete Representations Strengthen Vision Transformer Robustness [paper]
  • [StyleSwin] StyleSwin: Transformer-based GAN for High-resolution Image Generation [paper] [code]
  • [SReT] Sliced Recursive Transformer [paper] [code]
  • Dynamic Token Normalization Improves Vision Transformer [paper]
  • TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [paper] [code]
  • Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [paper]
  • [ORViT] Object-Region Video Transformers [paper] [code]
  • Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation [paper] [code]
  • [NViT] NViT: Vision Transformer Compression and Parameter Redistribution [paper]
  • 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning [paper]
  • Adversarial Token Attacks on Vision Transformers [paper]
  • Contextual Transformer Networks for Visual Recognition [paper] [code]
  • [TranSalNet] TranSalNet: Visual saliency prediction using transformers [paper]
  • [MobileViT] MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [paper]
  • A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition [paper]
  • [3D-Transformer] 3D-Transformer: Molecular Representation with Transformer in 3D Space [paper]
  • [CCTrans] CCTrans: Simplifying and Improving Crowd Counting with Transformer [paper]
  • [UFO-ViT] UFO-ViT: High Performance Linear Vision Transformer without Softmax [paper]
  • Sparse Spatial Transformers for Few-Shot Learning [paper]
  • Vision Transformer Hashing for Image Retrieval [paper]
  • [OH-Former] OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [paper]
  • [Pix2seq] Pix2seq: A Language Modeling Framework for Object Detection [paper]
  • [CoAtNet] CoAtNet: Marrying Convolution and Attention for All Data Sizes [paper]
  • [LOTR] LOTR: Face Landmark Localization Using Localization Transformer [paper]
  • Transformer-Unet: Raw Image Processing with Unet [paper]
  • [GraFormer] GraFormer: Graph Convolution Transformer for 3D Pose Estimation [paper]
  • [CDTrans] CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [paper]
  • PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds [paper] [code]
  • Anchor DETR: Query Design for Transformer-Based Detector [paper] [code]
  • [DAB-DETR] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [paper] [code]
  • [ESRT] Efficient Transformer for Single Image Super-Resolution [paper]
  • [MaskFormer] MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation [paper] [code]
  • [SwinIR] SwinIR: Image Restoration Using Swin Transformer [paper] [code]
  • [Trans4Trans] Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance [paper]
  • Do Vision Transformers See Like Convolutional Neural Networks? [paper]
  • Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net [paper]
  • Light Field Image Super-Resolution with Transformers [paper] [code]
  • Focal Self-attention for Local-Global Interactions in Vision Transformers [paper] [code]
  • Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers [paper] [code]
  • Mobile-Former: Bridging MobileNet and Transformer [paper]
  • [TriTransNet] TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network [paper]
  • [PSViT] PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [paper]
  • Boosting Few-shot Semantic Segmentation with Transformers [paper] [code]
  • Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [paper]
  • Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [paper]
  • [Styleformer] Styleformer: Transformer based Generative Adversarial Networks with Style Vector [paper] [code]
  • [CMT] CMT: Convolutional Neural Networks Meet Vision Transformers [paper]
  • [TransAttUnet] TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation [paper]
  • TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation [paper]
  • [ViTGAN] ViTGAN: Training GANs with Vision Transformers [paper]
  • What Makes for Hierarchical Vision Transformer? [paper]
  • [Trans4Trans] Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World [paper]
  • [FFVT] Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [paper]
  • [TransformerFusion] TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [paper]
  • Escaping the Big Data Paradigm with Compact Transformers [paper]
  • How to train your ViT? Data, Augmentation,and Regularization in Vision Transformers [paper]
  • Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [paper]
  • [XCiT] XCiT: Cross-Covariance Image Transformers [paper] [code]
  • Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [paper] [code]
  • Video Swin Transformer [paper] [code]
  • [VOLO] VOLO: Vision Outlooker for Visual Recognition [paper] [code]
  • Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [paper]
  • End-to-end Temporal Action Detection with Transformer [paper] [code]
  • How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [paper]
  • Efficient Self-supervised Vision Transformers for Representation Learning [paper]
  • Space-time Mixing Attention for Video Transformer [paper]
  • Transformed CNNs: recasting pre-trained convolutional layers with self-attention [paper]
  • [CAT] CAT: Cross Attention in Vision Transformer [paper]
  • Scaling Vision Transformers [paper]
  • [DETReg] DETReg: Unsupervised Pretraining with Region Priors for Object Detection [paper] [code]
  • Chasing Sparsity in Vision Transformers:An End-to-End Exploration [paper]
  • [MViT] MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [paper]
  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [paper]
  • On Improving Adversarial Transferability of Vision Transformers [paper]
  • Fully Transformer Networks for Semantic ImageSegmentation [paper]
  • Visual Transformer for Task-aware Active Learning [paper] [code]
  • Efficient Training of Visual Transformers with Small-Size Datasets [paper]
  • Reveal of Vision Transformers Robustness against Adversarial Attacks [paper]
  • Person Re-Identification with a Locally Aware Transformer [paper]
  • [Refiner] Refiner: Refining Self-attention for Vision Transformers [paper]
  • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [paper]
  • Video Instance Segmentation using Inter-Frame Communication Transformers [paper]
  • Transformer in Convolutional Neural Networks [paper] [code]
  • [Uformer] Uformer: A General U-Shaped Transformer for Image Restoration [paper] [code]
  • Patch Slimming for Efficient Vision Transformers [paper]
  • [RegionViT] RegionViT: Regional-to-Local Attention for Vision Transformers [paper]
  • Associating Objects with Transformers for Video Object Segmentation [paper] [code]
  • Few-Shot Segmentation via Cycle-Consistent Transformer [paper]
  • Glance-and-Gaze Vision Transformer [paper] [code]
  • Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [paper]
  • [DynamicViT] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper] [code]
  • When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [paper] [code]
  • Unsupervised Out-of-Domain Detection via Pre-trained Transformers [paper]
  • [TransMIL] TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [paper]
  • [TransVOS] TransVOS: Video Object Segmentation with Transformers [paper]
  • [KVT] KVT: k-NN Attention for Boosting Vision Transformers [paper]
  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [paper] [code]
  • [SegFormer] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [paper] [code]
  • [SDNet] SDNet: mutil-branch for single image deraining using swin [paper] [code]
  • [DVT] Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [paper]
  • [GazeTR] Gaze Estimation using Transformer [paper] [code]
  • Transformer-Based Deep Image Matching for Generalizable Person Re-identification [paper]
  • Less is More: Pay Less Attention in Vision Transformers [paper]
  • [FoveaTer] FoveaTer: Foveated Transformer for Image Classification [paper]
  • [TransDA] Transformer-Based Source-Free Domain Adaptation [paper] [code]
  • An Attention Free Transformer [paper]
  • [PTNet] PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition [paper] [code]
  • [CogView] CogView: Mastering Text-to-Image Generation via Transformers [paper]
  • [NesT] Aggregating Nested Transformers [paper]
  • [TAPG] Temporal Action Proposal Generation with Transformers [paper]
  • Boosting Crowd Counting with Transformers [paper]
  • [COTR] COTR: Convolution in Transformer Network for End to End Polyp Detection [paper]
  • [TransVOD] End-to-End Video Object Detection with Spatial-Temporal Transformers [paper] [code]
  • Intriguing Properties of Vision Transformers [paper] [code]
  • Combining Transformer Generators with Convolutional Discriminators [paper]
  • Rethinking the Design Principles of Robust Vision Transformer [paper]
  • Vision Transformers are Robust Learners [paper] [code]
  • Manipulation Detection in Satellite Images Using Vision Transformer [paper]
  • [Swin-Unet] Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [paper] [code]
  • Self-Supervised Learning with Swin Transformers [paper] [code]
  • [SCTN] SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [paper]
  • [RelationTrack] RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [paper]
  • [VGTR] Visual Grounding with Transformers [paper]
  • [PST] Visual Composite Set Detection Using Part-and-Sum Transformers [paper]
  • [TrTr] TrTr: Visual Tracking with Transformer [paper] [code]
  • [MOTR] MOTR: End-to-End Multiple-Object Tracking with TRansformer [paper] [code]
  • Attention for Image Registration (AiR): an unsupervised Transformer approach [paper]
  • [TransHash] TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [paper]
  • [ISTR] ISTR: End-to-End Instance Segmentation with Transformers [paper] [code]
  • [CAT] CAT: Cross-Attention Transformer for One-Shot Object Detection [paper]
  • [CoSformer] CoSformer: Detecting Co-Salient Object with Transformers [paper]
  • End-to-End Attention-based Image Captioning [paper]
  • [PMTrans] Pyramid Medical Transformer for Medical Image Segmentation [paper]
  • [HandsFormer] HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [paper]
  • [GasHis-Transformer] GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [paper]
  • Emerging Properties in Self-Supervised Vision Transformers [paper]
  • [InTra] Inpainting Transformer for Anomaly Detection [paper]
  • [Twins] Twins: Revisiting Spatial Attention Design in Vision Transformers [paper] [code]
  • [MLMSPT] Point Cloud Learning with Transformer [paper]
  • Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [paper]
  • [ConTNet] ConTNet: Why not use convolution and transformer at the same time? [paper] [code]
  • [DTNet] Dual Transformer for Point Cloud Analysis [paper]
  • Improve Vision Transformers Training by Suppressing Over-smoothing [paper] [code]
  • Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [paper]
  • [M3DeTR] M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [paper] [code]
  • [Skeletor] Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [paper]
  • [FaceT] Learning to Cluster Faces via Transformer [paper]
  • [MViT] Multiscale Vision Transformers [paper] [code]
  • [VATT] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [paper]
  • [So-ViT] So-ViT: Mind Visual Tokens for Vision Transformer [paper] [code]
  • Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [paper] [code]
  • [TransRPPG] TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [paper]
  • [VideoGPT] VideoGPT: Video Generation using VQ-VAE and Transformers [paper]
  • [M2TR] M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [paper]
  • Transformer Transforms Salient Object Detection and Camouflaged Object Detection [paper]
  • [TransCrowd] TransCrowd: Weakly-Supervised Crowd Counting with Transformer [paper] [code]
  • Visual Transformer Pruning [paper]
  • Self-supervised Video Retrieval Transformer Network [paper]
  • Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [paper]
  • [TransGAN] TransGAN: Two Transformers Can Make One Strong GAN [paper] [code]
  • Geometry-Free View Synthesis: Transformers and no 3D Priors [paper] [code]
  • [CoaT] Co-Scale Conv-Attentional Image Transformers [paper] [code]
  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers [paper] [code]
  • [CIT] Cloth Interactive Transformer for Virtual Try-On [paper] [code]
  • Handwriting Transformers [paper]
  • [SiT] SiT: Self-supervised vIsion Transformer [paper] [code]
  • On the Robustness of Vision Transformers to Adversarial Examples [paper]
  • An Empirical Study of Training Self-Supervised Visual Transformers [paper]
  • A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [paper]
  • [AOT-GAN] Aggregated Contextual Transformations for High-Resolution Image Inpainting [paper] [code]
  • Deepfake Detection Scheme Based on Vision Transformer and Distillation [paper]
  • [ATAG] Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [paper]
  • [TubeR] TubeR: Tube-Transformer for Action Detection [paper]
  • [AAformer] AAformer: Auto-Aligned Transformer for Person Re-Identification [paper]
  • [TFill] TFill: Image Completion via a Transformer-Based Architecture [paper]
  • Group-Free 3D Object Detection via Transformers [paper] [code]
  • [STGT] Spatial-Temporal Graph Transformer for Multiple Object Tracking [paper]
  • Going deeper with Image Transformers[paper]
  • [Meta-DETR] Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [paper [code]
  • [DA-DETR] DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [paper]
  • Robust Facial Expression Recognition with Convolutional Visual Transformers [paper]
  • Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [paper]
  • Spatiotemporal Transformer for Video-based Person Re-identification[paper]
  • [TransUNet] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [paper] [code]
  • [CvT] CvT: Introducing Convolutions to Vision Transformers [paper] [code]
  • [TFPose] TFPose: Direct Human Pose Estimation with Transformers [paper]
  • [TransCenter] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [paper]
  • Face Transformer for Recognition [paper]
  • On the Adversarial Robustness of Visual Transformers [paper]
  • Understanding Robustness of Transformers for Image Classification [paper]
  • Lifting Transformer for 3D Human Pose Estimation in Video [paper]
  • [GSA-Net] Global Self-Attention Networks for Image Recognition[paper]
  • High-Fidelity Pluralistic Image Completion with Transformers [paper] [code]
  • [DPT] Vision Transformers for Dense Prediction [paper] [code]
  • [TransFG] TransFG: A Transformer Architecture for Fine-grained Recognition? [paper]
  • [TimeSformer] Is Space-Time Attention All You Need for Video Understanding? [paper]
  • Multi-view 3D Reconstruction with Transformer [paper]
  • Can Vision Transformers Learn without Natural Images? [paper] [code]
  • End-to-End Trainable Multi-Instance Pose Estimation with Transformers [paper]
  • Instance-level Image Retrieval using Reranking Transformers [paper] [code]
  • [BossNAS] BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [paper] [code]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers [paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer [paper]
  • Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [paper]
  • 3D Human Pose Estimation with Spatial and Temporal Transformers [paper] [code]
  • [SUNETR] SUNETR: Transformers for 3D Medical Image Segmentation [paper]
  • Scalable Visual Transformers with Hierarchical Pooling [paper]
  • [ConViT] ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [paper]
  • [TransMed] TransMed: Transformers Advance Multi-modal Medical Image Classification [paper]
  • [U-Transformer] U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [paper]
  • [SpecTr] SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [paper] [code]
  • [TransBTS] TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [paper] [code]
  • [SSTN] SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving [paper]
  • Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [paper] [code]
  • [CPVT] Do We Really Need Explicit Position Encodings for Vision Transformers? [paper] [code]
  • Deepfake Video Detection Using Convolutional Vision Transformer[paper]
  • Training Vision Transformers for Image Retrieval[paper]
  • [VTN] Video Transformer Network[paper]
  • [BoTNet] Bottleneck Transformers for Visual Recognition [paper]
  • [CPTR] CPTR: Full Transformer Network for Image Captioning [paper]
  • Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [paper] [code]
  • [Trans2Seg] Segmenting Transparent Object in the Wild with Transformer [paper] [code]
  • Investigating the Vision Transformer Model for Image Retrieval Tasks [paper]
  • [Trear] Trear: Transformer-based RGB-D Egocentric Action Recognition [paper]
  • [VisualSparta] VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [paper]
  • [TrackFormer] TrackFormer: Multi-Object Tracking with Transformers [paper]
  • [TAPE] Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [paper]
  • [TRIQ] Transformer for Image Quality Assessment [paper] [code]
  • [TransTrack] TransTrack: Multiple-Object Tracking with Transformer [paper] [code]
  • [DeiT] Training data-efficient image transformers & distillation through attention [paper] [code]
  • [Pointformer] 3D Object Detection with Pointformer [paper]
  • [ViT-FRCNN] Toward Transformer-Based Object Detection [paper]
  • [Taming-transformers] Taming Transformers for High-Resolution Image Synthesis [paper] [code]
  • [SceneFormer] SceneFormer: Indoor Scene Generation with Transformers [paper]
  • [PCT] PCT: Point Cloud Transformer [paper]
  • [PED] DETR for Pedestrian Detection[paper]
  • Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[paper]
  • [C-Tran] General Multi-label Image Classification with Transformers [paper]

2022

TPAMI

  • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding [paper]

ECCV

  • [X-CLIP] Expanding Language-Image Pretrained Models for General Video Recognition [paper] [code]
  • [TinyViT] TinyViT: Fast Pretraining Distillation for Small Vision Transformers [paper] [code]
  • [FastMETRO] Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [paper] [code]
  • [AiATrack] AiATrack: Attention in Attention for Transformer Visual Tracking [paper] [code]
  • [OSTrack] Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [paper] [code]
  • [Unicorn] Towards Grand Unification of Object Tracking [paper] [code]
  • [P3AFormer] Tracking Objects as Pixel-wise Distributions [paper] [code]

CVPR

  • [MAE] Masked Autoencoders Are Scalable Vision Learners [paper] [code]
  • CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [paper] [code]
  • Fast Point Transformer [paper]
  • EDTER: Edge Detection With Transformer [paper] [code]
  • Bridged Transformer for Vision and Point Cloud 3D Object Detection [paper]
  • MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution [paper]
  • HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening [paper] [code]
  • Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation [paper]
  • MPViT: Multi-Path Vision Transformer for Dense Prediction [paper] [code]
  • A-ViT: Adaptive Tokens for Efficient Vision Transformer [paper]
  • TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [paper] [code]
  • Continual Learning With Lifelong Vision Transformer [paper]
  • Swin Transformer V2: Scaling Up Capacity and Resolution [paper] [code]
  • Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds [paper] [code]
  • Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [paper]
  • Human-Object Interaction Detection via Disentangled Transformer [paper]
  • LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [paper]
  • Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [paper]
  • Vision Transformer With Deformable Attention [paper]
  • DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [paper]
  • [Restormer] Restormer: Efficient Transformer for High-Resolution Image Restoration [paper] [code]
  • [SAM-DETR] Accelerating DETR Convergence via Semantic-Aligned Matching [paper] [code]
  • [BEVT] BEVT: BERT Pretraining of Video Transformers [paper] [code]
  • [MobileFormer] Mobile-Former: Bridging MobileNet and Transformer [paper]
  • [STRM] Spatio-temporal Relation Modeling for Few-shot Action Recognition [paper] [code]
  • [MiniViT] MiniViT: Compressing Vision Transformers with Weight Multiplexing [paper] [code]
  • [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]
  • [DW-ViT] Beyond Fixation: Dynamic Window Visual Transformer [paper] [code]
  • [TokenFusion] Multimodal Token Fusion for Vision Transformers [paper]
  • [CMT] Convolutional Neural Networks Meet Vision Transformers [paper]
  • Fine-tuning Image Transformers using Learnable Memory [paper]
  • [TransMix] Attend to Mix for Vision Transformers [paper] [code]
  • [NomMer] Nominate Synergistic Context in Vision Transformer for Visual Recognition [paper] [code]
  • [SSA] Shunted Self-Attention via Multi-Scale Token Aggregation [paper] [code]
  • [RVT] Towards Robust Vision Transformer [paper [code]
  • [LVT] Lite Vision Transformer with Enhanced Self-Attention [paper [code]
  • [StyTr2] StyTr2: Image Style Transfer with Transformers [paper] [code]

WACV

  • Image-Adaptive Hint Generation via Vision Transformer for Outpainting [paper] [code]

ICLR

  • [RelViT] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [paper] [code]

  • [CrossFormer] CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention [paper] [code]

  • Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [paper] [code]

  • [DAB-DETR] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [paper] [code]

2021

NeurIPS

  • ProTo: Program-Guided Transformer for Program-Guided Tasks [paper] [code]
  • [Augvit] Augmented Shortcuts for Vision Transformers [paper] [code]
  • [YOLOS] You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [paper] [code]
  • [CATs] Semantic Correspondence with Transformers [paper] [code]
  • [Moment-DETR] QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [paper] [code]
  • Dual-stream Network for Visual Recognition [paper] [code]
  • [Container] Container: Context Aggregation Network [paper] [code]
  • [TNT] Transformer in Transformer [paper] [code]
  • T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [paper]
  • Long Short-Term Transformer for Online Action Detection [paper]
  • TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [paper]
  • TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification [paper]
  • TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [paper]
  • Associating Objects with Transformers for Video Object Segmentation [paper]
  • Test-Time Personalization with a Transformer for Human Pose Estimation [paper]
  • Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning [paper]
  • Dynamic Grained Encoder for Vision Transformers [paper]
  • HRFormer: High-Resolution Vision Transformer for Dense Predict [paper]
  • Searching the Search Space of Vision Transformer [paper]
  • Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [paper]
  • SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [paper]
  • Do Vision Transformers See Like Convolutional Neural Networks? [paper]
  • Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [paper]
  • Glance-and-Gaze Vision Transformer [paper]
  • MST: Masked Self-Supervised Transformer for Visual Representation [paper]
  • DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper]
  • TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [paper]
  • Augmented Shortcuts for Vision Transformers [paper]
  • Improved Transformer for High-Resolution GANs [paper]
  • All Tokens Matter: Token Labeling for Training Better Vision Transformers [paper]
  • XCiT: Cross-Covariance Image Transformers [paper]
  • Efficient Training of Visual Transformers with Small Datasets [paper]

ICCV

  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Marr Prize) [paper] [code]
  • [ICT] High-Fidelity Pluralistic Image Completion with Transformers [paper] [code]
  • [PoinTr] PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers (oral) [paper] [code]
  • [STTR] Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [paper] [code]
  • [TSP-FCOS] Rethinking Transformer-based Set Prediction for Object Detection [paper]
  • Paint Transformer: Feed Forward Neural Painting with Stroke Prediction (oral) ) [paper [code]
  • 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [paper]
  • [T2T-ViT] Training Vision Transformers from Scratch on ImageNet [paper] [code]
  • [THUNDR] THUNDR: Transformer-Based 3D Human Reconstruction With Markers [paper]
  • Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [paper]
  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [paper] [code]
  • Spatial-Temporal Transformer for Dynamic Scene Graph Generation [paper]
  • [GLiT] GLiT: Neural Architecture Search for Global and Local Image Transformer [paper]
  • [TRAR] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering [paper]
  • [UniT] UniT: Multimodal Multitask Learning With a Unified Transformer [paper] [code]
  • Stochastic Transformer Networks With Linear Competing Units: Application To End-to-End SL Translation [paper]
  • Transformer-Based Dual Relation Graph for Multi-Label Image Recognition [paper]
  • [LocalTrans] LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [paper]
  • Improving 3D Object Detection With Channel-Wise Transformer [paper]
  • A Latent Transformer for Disentangled Face Editing in Images and Videos [paper] [code]
  • [GroupFormer] GroupFormer: Group Activity Recognition With Clustered Spatial-Temporal Transformer [paper]
  • Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue [paper]
  • [WB-DETR] WB-DETR: Transformer-Based Detector Without Backbone [paper]
  • The Animation Transformer: Visual Correspondence via Segment Matching [paper]
  • The Animation Transformer: Visual Correspondence via Segment Matching [paper]
  • Relaxed Transformer Decoders for Direct Action Proposal Generation [paper]
  • [PPT-Net] Pyramid Point Cloud Transformer for Large-Scale Place Recognition [paper] [code]
  • Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images [paper]
  • Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection [paper]
  • Image Harmonization With Transformer [paper] [cpde]
  • [COTR] COTR: Correspondence Transformer for Matching Across Images [paper]
  • [MUSIQ] MUSIQ: Multi-Scale Image Quality Transformer [paper]
  • Episodic Transformer for Vision-and-Language Navigation [paper]
  • Action-Conditioned 3D Human Motion Synthesis With Transformer VAE [paper]
  • [CrackFormer] CrackFormer: Transformer Network for Fine-Grained Crack Detection [paper]
  • [HiT] HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval [paper]
  • Event-Based Video Reconstruction Using Transformer [paper]
  • [STVGBert] STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding [paper]
  • [HiFT] HiFT: Hierarchical Feature Transformer for Aerial Tracking [paper] [code]
  • [DocFormer] DocFormer: End-to-End Transformer for Document Understanding [paper]
  • [LeViT] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [paper] [code]
  • [SignBERT] SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition[paper]
  • [VidTr] VidTr: Video Transformer Without Convolutions [paper]
  • [ACTOR] Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [paper]
  • [Segmenter] Segmenter: Transformer for Semantic Segmentation [paper] [code]
  • [Visformer] Visformer: The Vision-friendly Transformer [paper] [code]
  • [PnP-DETR] PnP-DETR: Towards Efficient Visual Analysis with Transformers (ICCV) [paper] [code]
  • [VoTr] Voxel Transformer for 3D Object Detection [paper]
  • [TransVG] TransVG: End-to-End Visual Grounding with Transformers [paper]
  • [3DETR] An End-to-End Transformer Model for 3D Object Detection [paper] [code]
  • [Eformer] Eformer: Edge Enhancement based Transformer for Medical Image Denoising [paper]
  • [TransFER] TransFER: Learning Relation-aware Facial Expression Representations with Transformers [paper]
  • [Oriented RCNN] Oriented Object Detection with Transformer [paper]
  • [ViViT] ViViT: A Video Vision Transformer [paper]
  • [Stark] Learning Spatio-Temporal Transformer for Visual Tracking [paper] [code]
  • [CT3D] Improving 3D Object Detection with Channel-wise Transformer [paper]
  • [VST] Visual Saliency Transformer [paper]
  • [PiT] Rethinking Spatial Dimensions of Vision Transformers [paper] [code]
  • [CrossViT] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [paper] [code]
  • [PointTransformer] Point Transformer [paper]
  • [TS-CAM] TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [paper] [code]
  • [VTs] Visual Transformers: Token-based Image Representation and Processing for Computer Vision [paper]
  • [TransDepth] Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction [paper] [code]
  • [Conditional DETR] Conditional DETR for Fast Training Convergence [paper] [code]
  • [PIT] PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [paper] [code]
  • [SOTR] SOTR: Segmenting Objects with Transformers [paper] [code]
  • [SnowflakeNet] SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer [paper] [code]
  • [TransPose] TransPose: Keypoint Localization via Transformer [paper] [code]
  • [TransReID] TransReID: Transformer-based Object Re-Identification [paper] [code]
  • [CWT] Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer [paper] [code]
  • Anticipative Video Transformer [paper] [code]
  • Rethinking and Improving Relative Position Encoding for Vision Transformer [paper] [code]
  • Vision Transformer with Progressive Sampling [paper] [code]
  • [SMCA] Fast Convergence of DETR with Spatially Modulated Co-Attention [paper] [code]
  • [AutoFormer] AutoFormer: Searching Transformers for Visual Recognition [paper] [code]

CVPR

  • Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer [paper]
  • [HOTR] HOTR: End-to-End Human-Object Interaction Detection with Transformers (oral) [paper]
  • [METRO] End-to-End Human Pose and Mesh Reconstruction with Transformers [paper]
  • [LETR] Line Segment Detection Using Transformers without Edges [paper]
  • [TransFuser] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [paper] [code]
  • Pose Recognition with Cascade Transformers [paper]
  • Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [paper]
  • [LoFTR] LoFTR: Detector-Free Local Feature Matching with Transformers [paper] [code]
  • Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [paper]
  • [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [paper] [code]
  • [TransT] Transformer Tracking [paper] [code]
  • Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (** oral**) [paper]
  • [VisTR] End-to-End Video Instance Segmentation with Transformers [paper]
  • Transformer Interpretability Beyond Attention Visualization [paper] [code]
  • [IPT] Pre-Trained Image Processing Transformer [paper]
  • [UP-DETR] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [paper]
  • [IQT] Perceptual Image Quality Assessment with Transformers (workshop) [paper]
  • High-Resolution Complex Scene Synthesis with Transformers (workshop) [paper]
  • [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]

ICML

  • Generative Video Transformer: Can Objects be the Words? [paper]
  • [GANsformer] Generative Adversarial Transformers [paper] [code]

ICRA

  • [NDT-Transformer] NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation [paper]

ICLR

  • [VTNet] VTNet: Visual Transformer Network for Object Goal Navigation [paper]
  • [Vision Transformer] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [paper] [code]
  • [Deformable DETR] Deformable DETR: Deformable Transformers for End-to-End Object Detection [paper] [code]
  • [LAMBDANETWORKS] MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION [paper] [code]

ACM MM

  • Video Transformer for Deepfake Detection with Incremental Learning[paper]
  • [HAT] HAT: Hierarchical Aggregation Transformers for Person Re-identification [paper]
  • Token Shift Transformer for Video Classification [paper] [code]
  • [DPT] DPT: Deformable Patch-based Transformer for Visual Recognition [paper] [code]

MICCAI

  • [UTNet] UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [paper] [code]
  • [MedT] Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [paper] [code]
  • [MCTrans] Multi-Compound Transformer for Accurate Biomedical Image Segmentation [paper] [code]
  • [PNS-Net] Progressively Normalized Self-Attention Network for Video Polyp Segmentation [paper] [code]
  • [MBT-Net] A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation [paper]

BMVC

  • [ACT] End-to-End Object Detection with Adaptive Clustering Transformer [paper]
  • [GSRTR] Grounded Situation Recognition with Transformers [paper] [code]
  • [TransFusion] TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation [paper] [code]

ISIE

  • VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (ISIE) [paper]

CORL

  • [DETR3D] DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [paper]

IJCAI

  • Medical Image Segmentation using Squeeze-and-Expansion Transformers [paper]

IROS

  • [YOGO] You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module (IROS) [paper] [code]
  • [PTT] PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds [paper] [code]

WACV

  • [LSTR] End-to-end Lane Shape Prediction with Transformers [paper] [code]

ICDAR

  • Vision Transformer for Fast and Efficient Scene Text Recognition [paper]

2020

  • [DETR] End-to-End Object Detection with Transformers (ECCV) [paper] [code]
  • [FPT] Feature Pyramid Transformer (CVPR) [paper] [code]

Other resource

Acknowledgement

Thanks the template from Awesome-Crowd-Counting