Transformer-Based Visual Segmentation: A Survey
arXiv, 2023
Xiangtai Li
·
Henghui Ding
·
Wenwei Zhang
·
Haobo Yuan
·
Guangliang Cheng
Jiangmiao Pang
.
Kai Chen
.
Ziwei Liu
.
Chen Change Loy
This repo is used for recording, tracking, and benchmarking several recent transformer-based visual segmentation methods,
as a supplement to our survey.
If you find any work missing or have any suggestions (papers, implementations and other resources), feel free
to pull requests.
We will add the missing papers to this repo ASAP.
🔥News
[-] The second draft is on arxiv.
🔥Highlight!!
[1], Previous transformer surveys divide the methods by the different tasks and settings. Different from them, we re-visit and group the existing transformer-based methods from the technical perspective.
[2], We survey the methods in two parts: one for the mainstream tasks based on DETR-like meta-architecture, the other for related directions according to the tasks.
[3], We further re-benchmark several representative works on image semantic segmentation and panoptic segmentation datasets.
[4], We also include the query-based detection transformers since both segmentation and detection tasks are unified by object query.
Introduction
In this survey, we present the first detailed survey on Transformer-Based Segmentation.
Summary of Contents
Methods: A Survey
Meta-Architecture
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2020 | ECCV | DETR | End-to-End Object Detection with Transformers | Code |
2021 | ICLR | Deformable DETR | Deformable DETR: Deformable Transformers for End-to-End Object Detection | Code |
2021 | CVPR | Max-Deeplab | MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers | Code |
2021 | NeurIPS | MaskFormer | MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation | Code |
2021 | NeurIPS | K-Net | K-Net: Towards Unified Image Segmentation | Code |
2023 | CVPR | Lite-DETR | Lite detr: An interleaved multi-scale encoder for efficient detr | Code |
Strong Representation
Better ViTs Design
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | CVPR | SETR | Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers | Code |
2021 | ICCV | MviT-V1 | Multiscale vision transformers | Code |
2022 | CVPR | MviT-V2 | MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | Code |
2021 | NeurIPS | XCIT | Xcit: Crosscovariance image transformers | Code |
2021 | ICCV | Pyramid VIT | Pyramid vision transformer: A versatile backbone for dense prediction without convolutions | Code |
2021 | ICCV | CorssViT | CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification | Code |
2021 | ICCV | CoaT | Co-Scale Conv-Attentional Image Transformers | Code |
2022 | CVPR | MPViT | MPViT: Multi-Path Vision Transformer for Dense Prediction | Code |
2022 | NeurIPS | SegViT | SegViT: Semantic Segmentation with Plain Vision Transformers | Code |
2022 | arxiv | RSSeg | Representation Separation for SemanticSegmentation with Vision Transformers | N/A |
Hybrid CNNs/Transformers/MLPs
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | ICCV | Swin | Swin transformer: Hierarchical vision transformer using shifted windows | Code |
2022 | CVPR | Swin-v2 | Swin Transformer V2: Scaling Up Capacity and Resolution | Code |
2021 | NeurIPS | Segformer | SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | Code |
2022 | CVPR | CMT | CMT: Convolutional Neural Networks Meet Vision Transformers | Code |
2021 | NeurIPS | Twins | Twins: Revisiting the Design of Spatial Attention in Vision Transformers | Code |
2021 | ICCV | CvT | CvT: Introducing Convolutions to Vision Transformers | Code |
2021 | NeurIPS | Vitae | Vitae: Vision transformer advanced by exploring intrinsic inductive bias | Code |
2022 | CVPR | ConvNext | A ConvNet for the 2020s | Code |
2022 | NeurIPS | SegNext | SegNeXt:Rethinking Convolutional Attention Design for Semantic Segmentation | Code |
2022 | CVPR | PoolFormer | PoolFormer: MetaFormer Is Actually What You Need for Vision | Code |
2022 | arxiv | STM | Demystify Transformers & Convolutions in Modern Image Deep Networks | Code |
Self-Supervised Learning
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | ICCV | MOCOV3 | An Empirical Study of Training Self-Supervised Vision Transformers | Code |
2022 | ICLR | Beit | Beit: Bert pre-training of image transformers | Code |
2022 | CVPR | MaskFeat | Masked Feature Prediction for Self-Supervised Visual Pre-Training | Code |
2022 | CVPR | MAE | Masked Autoencoders Are Scalable Vision Learners | Code |
2022 | NeurIPS | ConvMAE | MCMAE: Masked Convolution Meets Masked Autoencoders | Code |
2023 | ICLR | Spark | SparK: the first successful BERT/MAE-style pretraining on any convolutional networks | Code |
2022 | CVPR | FLIP | Scaling Language-Image Pre-training via Masking | Code |
2023 | arxiv | ConvNeXt V2 | ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders | Code |
Interaction Design in Decoder
Improved Cross Attention Design
Spatial-Temporal Cross Attention Design
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | CVPR | VisTR | VisTR: End-to-End Video Instance Segmentation with Transformers | Code |
2021 | NeurIPS | IFC | Video instance segmentation using inter-frame communication transformers | Code |
2022 | CVPR | SlotVPS | Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation | N/A |
2022 | CVPR | TubeFormer-DeepLab | TubeFormer-DeepLab: Video Mask Transformer | N/A |
2022 | CVPR | Video K-Net | Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation | Code |
2022 | CVPR | TeViT | Temporally efficient vision transformer for video instance segmentation | Code |
2022 | ECCV | Seqformer | SeqFormer: Sequential Transformer for Video Instance Segmentation | Code |
2022 | arxiv | Mask2Former-VIS | Mask2Former for Video Instance Segmentation | Code |
2022 | PAMI | TransVOD | TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers | Code |
2022 | NeurIPS | VITA | VITA: Video Instance Segmentation via Object Token Association | Code |
Optimizing Object Query
Adding Position Information into Query
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | ICCV | Conditional-DETR | Conditional DETR for Fast Training Convergence | Code |
2022 | arxiv | Conditional-DETR-v2 | Conditional detr v2:Efficient detection transformer with box queries | Code |
2022 | AAAI | Anchor DETR | Anchor detr: Query design for transformer-based detector | Code |
2022 | ICLR | DAB-DETR | DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR | Code |
2021 | arxiv | Efficient DETR | Efficient detr: improving end-to-end object etector with dense prior | N/A |
Adding Extra Supervision into Query
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2022 | ECCV | DE-DETR | Towards Data-Efficient Detection Transformers | Code |
2022 | CVPR | DN-DETR | Dndetr:Accelerate detr training by introducing query denoising | Code |
2023 | ICLR | DINO | DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection | Code |
2023 | CVPR | Mp-Former | Mp-former: Mask-piloted transformer for image segmentation | Code |
2023 | CVPR | Mask-DINO | Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | Code |
2022 | NeurIPS | N/A | Learning equivariant segmentation with instance-unique querying | Code |
2023 | CVPR | H-DETR | DETRs with Hybrid Matching | Code |
2023 | ICCV | Group-DETR | Group detr: Fast detr training with group-wise one-to-many assignment | N/A |
2023 | ICCV | Co-DETR | Detrs with collaborative hybrid assignments training | Code |
Using Query For Association
Query as Instance Association
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2022 | CVPR | TrackFormer | TrackFormer: Multi-Object Tracking with Transformer | Code |
2021 | arxiv | TransTrack | TransTrack: Multiple Object Tracking with Transformer | Code |
2022 | ECCV | MOTR | MOTR: End-to-End Multiple-Object Tracking with TRansformer | Code |
2022 | NeurIPS | MinVIS | MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training | Code |
2022 | ECCV | IDOL | In defense of online models for video instance segmentation | Code |
2022 | CVPR | Video K-Net | Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation | Code |
2023 | CVPR | GenVIS | A Generalized Framework for Video Instance Segmentation | Code |
2023 | ICCV | Tube-Link | Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation | Code |
2023 | ICCV | CTVIS | CTVIS: Consistent Training for Online Video Instance Segmentation | Code |
2023 | CVPR-W | Video-kMaX | Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation | N/A |
Query as Linking Multi-Tasks
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2022 | ECCV | Panoptic-PartFormer | Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation | Code |
2022 | ECCV | PolyphonicFormer | PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation | Code |
2022 | CVPR | PanopticDepth | Panopticdepth: A unified framework for depth-aware panoptic segmentation | Code |
2022 | ECCV | Fashionformer | Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition | Code |
2022 | ECCV | InvPT | InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding | Code |
2023 | CVPR | UNINEXT | Universal Instance Perception as Object Discovery and Retrieval | Code |
Conditional Query Generation
Conditional Query Fusion on Language Features
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | ICCV | VLT | Vision-Language Transformer and Query Generation for Referring Segmentation | Code |
2022 | CVPR | LAVT | Lavt: Language-aware vision transformer for referring image segmentation | Code |
2022 | CVPR | Restr | Restr:Convolution-free referring image segmentation using transformers | N/A |
2022 | CVPR | Cris | Cris: Clip-driven referring image segmentation | Code |
2022 | CVPR | MTTR | End-to-End Referring Video Object Segmentation with Multimodal Transformers | Code |
2022 | CVPR | LBDT | Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation | Code |
2022 | CVPR | ReferFormer | Language as queries for referring video object segmentation | Code |
Conditional Query Fusion on Cross Image Features
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | NeurIPS | CyCTR | Few-Shot Segmentation via Cycle-Consistent Transformer | Code |
2022 | CVPR | MatteFormer | MatteFormer: Transformer-Based Image Matting via Prior-Tokens | Code |
2022 | ECCV | Segdeformer | A Transformer-based Decoder for Semantic Segmentation with Multi-level Context Mining | Code |
2022 | arxiv | StructToken | StructToken : Rethinking Semantic Segmentation with Structural Prior | N/A |
2022 | NeurIPS | MM-Former | Mask Matching Transformer for Few-Shot Segmentation | Code |
2022 | ECCV | AAFormer | Adaptive Agent Transformer for Few-shot Segmentation | N/A |
2023 | arxiv | ReferenceTwice | Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation | Code |
Tuning Foundation Models
Vision Adapter
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2022 | CVPR | CoCoOp | Conditional Prompt Learning for Vision-Language Models | Code |
2022 | ECCV | Tip-Adapter | Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification | Code |
2022 | ECCV | EVL | Frozen CLIP Models are Efficient Video Learners | Code |
2023 | ICLR | ViT-Adapter | Vision Transformer Adapter for Dense Predictions | Code |
2022 | CVPR | DenseCLIP | DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting | Code |
2022 | CVPR | CLIPSeg | Image Segmentation Using Text and Image Prompts | Code |
2023 | CVPR | OneFormer | OneFormer: One Transformer to Rule Universal Image Segmentation | Code |
Open Vocabulary Learning
Related Domains and Beyond
Point Cloud Segmentation
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2021 | ICCV | Point Transformer | Point Transformer | N/A |
2021 | CVM | PCT | PCT: Point cloud transformer | Code |
2022 | CVPR | Stratified Transformer | Stratified Transformer for 3D Point Cloud Segmentation | Code |
2022 | CVPR | Point-BERT | Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling | Code |
2022 | ECCV | Point-MAE | Masked Autoencoders for Point Cloud Self-supervised Learning | Code |
2022 | NeurIPS | Point-M2AE | Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training | Code |
2022 | ICRA | Mask3D | Mask3D for 3D Semantic Instance Segmentation | Code |
2023 | AAAI | SPFormer | Superpoint Transformer for 3D Scene Instance Segmentation | Code |
2023 | AAAI | PUPS | PUPS: Point Cloud Unified Panoptic Segmentation | N/A |
Domain-aware Segmentation
Label and Model Efficient Segmentation
Class Agnostic Segmentation and Tracking
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2022 | CVPR | Transfiner | Mask Transfiner for High-Quality Instance Segmentation | Code |
2022 | ECCV | VMT | Video Mask Transfiner for High-Quality Video Instance Segmentation | Code |
2022 | arXiv | SimpleClick | SimpleClick: Interactive Image Segmentation with Simple Vision Transformers | Code |
2023 | ICLR | PatchDCT | PatchDCT: Patch Refinement for High Quality Instance Segmentation | Code |
2019 | ICCV | STM | Video Object Segmentation using Space-Time Memory Networks | Code |
2021 | NeurIPS | AOT | Associating Objects with Transformers for Video Object Segmentation | Code |
2021 | NeurIPS | STCN | Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation | Code |
2022 | ECCV | XMem | XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model | Code |
2022 | CVPR | PCVOS | Per-Clip Video Object Segmentation | Code |
2023 | CVPR | N/A | Look Before You Match: Instance Understanding Matters in Video Object Segmentation | N/A |
Medical Image Segmentation
Year | Venue | Acronym | Paper Title | Code/Project |
---|---|---|---|---|
2020 | BIBM | CellDETR | Attention-Based Transformers for Instance Segmentation of Cells in Microstructures | Code |
2021 | arXiv | TransUNet | TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation | Code |
2022 | ECCV Workshop | Swin-Unet | Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation | Code |
2021 | MICCAI | TransFuse | TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation | Code |
2022 | WACV | UNETR | UNETR: Transformers for 3D Medical Image Segmentation | Code |
Acknowledgement
If you find our survey and repository useful for your research project, please consider citing our paper:
@article{li2023transformer,
author={Li, Xiangtai and Ding, Henghui and Zhang, Wenwei and Yuan, Haobo and Cheng, Guangliang and Jiangmiao, Pang and Chen, Kai and Liu, Ziwei and Loy, Chen Change},
title={Transformer-Based Visual Segmentation: A Survey},
journal={arXiv pre-print},
year={2023}
}
Contact
Related Repo For Segmentation and Detection
Attention Model Repo by Min-Hung (Steve) Chen.
Detection Transformer Repo by IDEA.
Open Vocabulary Learning Repo by PKU and NTU.