• Stars
    star
    1,149
  • Rank 40,592 (Top 0.8 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[CVPR 2023] Official implementation of the paper "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation"

Mask DINO

PWC PWC PWC PWC PWC

Feng Li*, Hao Zhang*, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum.

This repository is the official implementation of the Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation (DINO pronounced `daษชnoสŠ' as in dinosaur). Our code is based on detectron2. detrex version is opensource simultaneously.

๐Ÿ”ฅ We release a strong open-set object detection and segmentation model OpenSeeD based on MaskDINO that achieves the best results on open-set object segmentation tasks. Code and checkpoints are available here.

News

[2023/7] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!

[2023/2] Mask DINO has been accepted to CVPR 2023!

[2022/9] We release a toolbox detrex that provides state-of-the-art Transformer-based detection algorithms. It includes DINO with better performance and Mask DINO will also be released with detrex implementation. Welcome to use it!

[2022/7] Code for DINO is available here!

[2022/3]We build a repo Awesome Detection Transformer to present papers about transformer for detection and segmentation. Welcome to your attention!

Features

  • A unified architecture for object detection, panoptic, instance and semantic segmentation.
  • Achieve task and data cooperation between detection and segmentation.
  • State-of-the-art performance under the same setting.
  • Support major detection and segmentation datasets: COCO, ADE20K, Cityscapes.

Code Updates

  • [2022/12/02] Our code and checkpoints are available! Mask DINO further Achieves 51.7 and 59.0 box AP on COCO with a ResNet-50 and SwinL without extra detection data, outperforming DINO under the same setting!

  • [2022/6] We propose a unified detection and segmentation model Mask DINO that achieves the best results on all the three segmentation tasks (54.7 AP on COCO instance leaderboard, 59.5 PQ on COCO panoptic leaderboard, and 60.8 mIoU on ADE20K semantic leaderboard)!

Todo list
  • Release code and checkpoints

  • Release model conversion checkpointer from DINO to MaskDINO

  • Release GPU cluster submit scripts based on submitit for multi-node training

  • Release EMA training for large models

  • Release more large models


Installation

See installation instructions.

Getting Started

See Inference Demo with Pre-trained Model

See Results.

See Preparing Datasets for MaskDINO.

See Getting Started.

See More Usage.

MaskDINO


Results

In this part, we present the clean models that do not use extra detection data or tricks.

COCO Instance Segmentation and Object Detection

we follow DINO to use hidden dimension 2048 in the encoder of feedforward by default. We also use the mask-enhanced box initialization proposed in our paper in instance segmentation and detection. To better present our model, we also list the models trained with hidden dimension 1024 (hid 1024) and not using mask-enhance initialization (no mask enhance) in this table.

Name Backbone Epochs Mask AP Box AP Params GFlops download
MaskDINO (hid 1024) | config R50 50 46.1 51.5 47M 226 model
MaskDINO | config R50 50 46.3 51.7 52M 286 model
MaskDINO (no mask enhance) | config Swin-L (IN21k) 50 52.1 58.3 223 1326 model
MaskDINO | config Swin-L (IN21k) 50 52.3 59.0 223 1326 model
MaskDINO+O365 data+1.2 x larger image Swin-L (IN21k) 20 54.5 --- 223 1326 To Release

COCO Panoptic Segmentation

Name Backbone epochs PQ Mask AP Box AP mIoU download
MaskDINO | config R50 50 53.0 48.8 44.3 60.6 model
MaskDINO | config Swin-L (IN21k) 50 58.3 50.6 56.2 67.5 model
MaskDINO+O365 data+1.2 x larger image Swin-L (IN21k) 20 59.4 53.0 57.7 67.3 To Release

Semantic Segmentation

We use hidden dimension 1024 and 100 queries for semantic segmentation.

Name Dataset Backbone iterations mIoU download
MaskDINO | config ADE20K R50 160k 48.7 model
MaskDINO | config Cityscapes R50 90k 79.8 model

You can also find all these models here.

All models were trained with 4 NVIDIA A100 GPUs (ResNet-50 based models) or 8 NVIDIA A100 GPUs (Swin-L based models).

We will release more pretrained models in the future.

Getting Started

In the above tables, the "Name" column contains a link config_path to the config file, and the corresponding model checkpoints can be downloaded from the link in model.

If your dataset files are not under this repo, you need to add export DETECTRON2_DATASETS=/path/to/your/data or use Symbolic Link ln -s to link the dataset into this repo before the following command first.

Evalaluate our pretrained models

  • You can download our pretrained models and evaluate them with the following commands.
    python train_net.py --eval-only --num-gpus 8 --config-file config_path MODEL.WEIGHTS /path/to/checkpoint_file
    for example, to reproduce our instance segmentation result, you can copy the config path from the table, download the pretrained checkpoint into /path/to/checkpoint_file, and run
    python train_net.py --eval-only --num-gpus 8 --config-file configs/coco/instance-segmentation/maskdino_R50_bs16_50ep_3s_dowsample1_2048.yaml MODEL.WEIGHTS /path/to/checkpoint_file
    which can reproduce the model.

Train MaskDINO to reproduce results

  • Use the above command without eval-only will train the model. For Swin backbones, you need to specify the path of the pretrained backbones with MODEL.WEIGHTS /path/to/pretrained_checkpoint
    python train_net.py --num-gpus 8 --config-file config_path MODEL.WEIGHTS /path/to/checkpoint_file
  • For ResNet-50 models, training on 8 GPU requires around 15G memory on each GPU and 3 days training for 50 epochs.
  • For Swin-L models, training on 8 gpu required memory 60G on each GPU. If your gpu do not have enough memory, you may also train with 16 GPUs with distributed training on two nodes.
  • We use total batch size 16 for all our models. If train on 1 GPU, you need to figure out learning rate and batch size by yourself
    python train_net.py --num-gpus 1 --config-file config_path SOLVER.IMS_PER_BATCH SET_TO_SOME_REASONABLE_VALUE SOLVER.BASE_LR SET_TO_SOME_REASONABLE_VALUE

You can also refer to Getting Started with Detectron2 for full usage.

More Usage

Mask-enhanced box initialization

We provide 2 ways to convert predicted masks to boxes to initialize decoder boxes. You can set as follows

  • MODEL.MaskDINO.INITIALIZE_BOX_TYPE: no not using mask enhanced box initialization
  • MODEL.MaskDINO.INITIALIZE_BOX_TYPE: mask2box a fast conversion way
  • MODEL.MaskDINO.INITIALIZE_BOX_TYPE: bitmask provided conversion from detectron2, slower but more accurate conversion.

These two conversion ways do not affect the final performance much, you can choose either way.

In addition, if you already train a model for 50 epochs without mask-enhance box initialization, you can plug in this method and simply finetune the model in the last few epochs (i.e., load from 32K iteration trained model and finetune it). This way can also achieve similar performance compared with training from scratch, but more flexible.

Model components

MaskDINO consists of three components: a backbone, a pixel decoder and a Transformer decoder. You can easily replace each of these three components with your own implementation.

  • backbone: Define and register your backbone under maskdino/modeling/backbone. You can follow the Swin Transformer as an example.

  • pixel decoder: pixel decoder is actually the multi-scale encoder in DINO and Deformable DETR, we follow mask2former to call it pixel decoder. It is in maskdino/modeling/pixel_decoder, you can change your multi-scale encoder. The returned values include

    1. mask_features is the per-pixel embeddings with resolution 1/4 of the original image, obtained by fusing backbone 1/4 features and multi-scale encoder encoded 1/8 features. This is used to produce binary masks.
    2. multi_scale_features, which is the multi-scale inputs to the Transformer decoder. For ResNet-50 models with 4 scales, we use resolution 1/32, 1/16, and 1/8 but you can use arbitrary resolutions here, and follow DINO to additionally downsample 1/32 to get a 4th scale with 1/64 resolution. For 5-scale models with SwinL, we additional use 1/4 resolution features as in DINO.
  • transformer decoder: it mainly follows DINO decoder to do detection and segmentation tasks. It is defined in maskdino/modeling/transformer_decoder.

LICNESE

Mask DINO is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Copyright (c) IDEA. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Citing Mask DINO

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{li2022mask,
      title={Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation}, 
      author={Feng Li and Hao Zhang and Huaizhe xu and Shilong Liu and Lei Zhang and Lionel M. Ni and Heung-Yeung Shum},
      year={2022},
      eprint={2206.02777},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

If you find the code useful, please also consider the following BibTeX entry.

@misc{zhang2022dino,
      title={DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection}, 
      author={Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel M. Ni and Heung-Yeung Shum},
      year={2022},
      eprint={2203.03605},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{li2022dn,
      title={Dn-detr: Accelerate detr training by introducing query denoising},
      author={Li, Feng and Zhang, Hao and Liu, Shilong and Guo, Jian and Ni, Lionel M and Zhang, Lei},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      pages={13619--13627},
      year={2022}
}

@inproceedings{
      liu2022dabdetr,
      title={{DAB}-{DETR}: Dynamic Anchor Boxes are Better Queries for {DETR}},
      author={Shilong Liu and Feng Li and Hao Zhang and Xiao Yang and Xianbiao Qi and Hang Su and Jun Zhu and Lei Zhang},
      booktitle={International Conference on Learning Representations},
      year={2022},
      url={https://openreview.net/forum?id=oMI9PjOb9Jl}
}

Acknowledgement

Many thanks to these excellent opensource projects

More Repositories

1

Grounded-Segment-Anything

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
Jupyter Notebook
14,724
star
2

GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Python
6,003
star
3

DINO

[ICLR 2023] Official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection"
Python
2,160
star
4

T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
Python
2,147
star
5

DWPose

"Effective Whole-body Pose Estimation with Two-stages Distillation" (ICCV 2023, CV4Metaverse Workshop)
Python
2,136
star
6

detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
Python
2,001
star
7

awesome-detection-transformer

Collect some papers about transformer for detection and segmentation. Awesome Detection Transformer for Computer Vision (CV)
1,261
star
8

Grounding-DINO-1.5-API

API for Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series
Python
680
star
9

OpenSeeD

[ICCV 2023] Official implementation of the paper "A Simple Framework for Open-Vocabulary Segmentation and Detection"
Python
650
star
10

Motion-X

[NeurIPS 2023] Official implementation of the paper "Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset"
Python
542
star
11

DN-DETR

[CVPR 2022 Oral] Official implementation of DN-DETR
Python
535
star
12

DAB-DETR

[ICLR 2022] Official implementation of the paper "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR"
Jupyter Notebook
499
star
13

OSX

[CVPR 2023] Official implementation of the paper "One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer"
Python
291
star
14

HumanTOMATO

[ICML 2024] ๐Ÿ…HumanTOMATO: Text-aligned Whole-body Motion Generation
Python
276
star
15

MotionLLM

[Arxiv-2024] MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Python
226
star
16

deepdataspace

The Go-To Choice for CV Data Visualization, Annotation, and Model Analysis.
TypeScript
212
star
17

Stable-DINO

[ICCV 2023] Official implementation of the paper "Detection Transformer with Stable Matching"
Python
203
star
18

Lite-DETR

[CVPR 2023] Official implementation of the paper "Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR"
Python
182
star
19

DreamWaltz

[NeurIPS 2023] Official implementation of the paper "DreamWaltz: Make a Scene with Complex 3D Animatable Avatars".
Python
176
star
20

MP-Former

[CVPR 2023] Official implementation of the paper: MP-Former: Mask-Piloted Transformer for Image Segmentation
Python
99
star
21

HumanSD

The official implementation of paper "HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation"
Python
92
star
22

HumanArt

The official implementation of CVPR 2023 paper "Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes"
86
star
23

ED-Pose

The official repo for [ICLR'23] "Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation "
Python
73
star
24

DQ-DETR

[AAAI 2023] DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
54
star
25

DisCo-CLIP

Official PyTorch implementation of the paper "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training".
Python
47
star
26

LipsFormer

Python
34
star
27

DiffHOI

Official implementation of the paper "Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model"
Python
29
star
28

hana

Implementation and checkpoints of Imagen, Google's text-to-image synthesis neural network, in Pytorch
Python
17
star
29

TOSS

[ICLR 2024] Official implementation of the paper "Toss: High-quality text-guided novel view synthesis from a single image"
Python
15
star
30

IYFC

C++
9
star
31

TAPTR

6
star
32

detrex-storage

2
star