• Stars
    star
    358
  • Rank 118,855 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Mask-Free Video Instance Segmentation [CVPR 2023]

MaskFreeVIS

Mask-Free Video Instance Segmentation [CVPR 2023].

This is the official pytorch implementation of MaskFreeVIS built on the open-source detectron2. We aim to remove the necessity for expensive video masks and even image masks for training VIS models. Our project website contains more information, including the visual video comparison: vis.xyz/pub/maskfreevis.

Mask-Free Video Instance Segmentation
Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
CVPR 2023

Highlights

  • High-performing video instance segmentation without using any video masks or even image mask labels. Using SwinL and built on Mask2Former, MaskFreeVIS achieved 56.0 AP on YTVIS without using any video masks labels. Using ResNet-101, MaskFreeVIS achieves 49.1 AP without using video masks, and 47.3 AP only using COCO mask initialized model.
  • Novelty: a new parameter-free Temporal KNN-patch Loss (TK-Loss), which leverages temporal masks consistency using unsupervised one-to-k patch correspondence.
  • Simple: TK-Loss is flexible to intergrated with state-of-the-art transformer-based VIS models, with no trainable parameters.

Visualization results of MaskFreeVIS

Introduction

The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance.

Methods

image

Installation

Please see Getting Started with Detectron2 for full usage.

Requirements

  • Linux or macOS with Python 3.6
  • PyTorch 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this. Note, please check PyTorch version matches that is required by Detectron2.
  • Detectron2: follow Detectron2 installation instructions.
  • OpenCV is optional but needed by demo and visualization
  • pip install -r requirements.txt

CUDA kernel for MSDeformAttn

After preparing the required environment, run the following command to compile CUDA kernel for MSDeformAttn:

CUDA_HOME must be defined and points to the directory of the installed CUDA toolkit.

cd mask2former/modeling/pixel_decoder/ops
sh make.sh

Building on another system

To build on a system that does not have a GPU device but provide the drivers:

TORCH_CUDA_ARCH_LIST='8.0' FORCE_CUDA=1 python setup.py build install

Example conda environment setup

conda create --name maskfreevis python=3.8 -y
conda activate maskfreevis
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python

# under your working directory
git clone [email protected]:facebookresearch/detectron2.git
cd detectron2
pip install -e .

cd ..
git clone https://github.com/SysCV/MaskFreeVIS.git
cd MaskFreeVIS
pip install -r requirements.txt
cd mask2former/modeling/pixel_decoder/ops
sh make.sh

Dataset preparation

Please see the document here.

Model Zoo

Video Instance Segmentation (YouTubeVIS)

Using COCO image masks without YTVIS video masks during training:

Config Name Backbone AP download Training Script COCO Init Weight
MaskFreeVIS R50 46.6 model script Init
MaskFreeVIS R101 49.1 model script Init
MaskFreeVIS Swin-L 56.0 model script Init

For below two training settings without using pseudo COCO images masks for joint video training, please change the folder to:

cd mfvis_nococo
  1. Only using COCO mask initialized model without YTVIS video masks during training:
Config Name Backbone AP download Training Script COCO Init Weight
MaskFreeVIS R50 43.8 model script Init
MaskFreeVIS R101 47.3 model script Init
  1. Only using COCO box initialized model without YTVIS video masks during training:
Config Name Backbone AP download Training Script COCO Box Init Weight
MaskFreeVIS R50 42.5 model script Init

Please see our script folder.

Inference & Evaluation

First download the provided trained model from our model zoo table and put them into the mfvis_models.

mkdir mfvis_models

Refer to our scripts folder for more commands:

Example evaluation scripts:

bash scripts/eval_8gpu_mask2former_r50_video.sh
bash scripts/eval_8gpu_mask2former_r101_video.sh
bash scripts/eval_8gpu_mask2former_swinl_video.sh

Results Visualization

Example visualization script:

bash scripts/visual_video.sh

Citation

If you find MaskFreeVIS useful in your research or refer to the provided baseline results, please star ⭐ this repository and consider citing 📝:

@inproceedings{maskfreevis,
    author={Ke, Lei and Danelljan, Martin and Ding, Henghui and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},
    title={Mask-Free Video Instance Segmentation},
    booktitle = {CVPR},
    year = {2023}
}  

Acknowledgments

  • Thanks BoxInst image-based instance segmentation losses.
  • Thanks Mask2Former and VMT for providing useful inference and evaluation toolkits.

More Repositories

1

sam-hq

Segment Anything in High Quality [NeurIPS 2023]
Python
3,689
star
2

sam-pt

SAM-PT: Extending SAM to zero-shot video segmentation with point-based tracking.
Python
970
star
3

transfiner

Mask Transfiner for High-Quality Instance Segmentation, CVPR 2022
Python
525
star
4

qd-3dt

Official implementation of Monocular Quasi-Dense 3D Object Tracking, TPAMI 2022
Python
515
star
5

qdtrack

Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)
Python
382
star
6

pcan

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation, NeurIPS 2021 Spotlight
Python
362
star
7

bdd100k-models

Model Zoo of BDD100K Dataset
Python
285
star
8

idisc

iDisc: Internal Discretization for Monocular Depth Estimation [CVPR 2023]
Python
279
star
9

LiDAR_snow_sim

LiDAR snowfall simulation
Python
172
star
10

r3d3

Python
144
star
11

P3Depth

Python
123
star
12

shift-dev

SHIFT Dataset DevKit - CVPR2022
Python
103
star
13

cascade-detr

[ICCV'23] Cascade-DETR: Delving into High-Quality Universal Object Detection
Python
92
star
14

tet

Implementation of Tracking Every Thing in the Wild, ECCV 2022
Python
69
star
15

TrafficBots

TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction. ICRA 2023. Code is now available at https://github.com/zhejz/TrafficBots
51
star
16

nutsh

A Platform for Visual Learning from Human Feedback
TypeScript
42
star
17

vmt

Video Mask Transfiner for High-Quality Video Instance Segmentation (ECCV'2022)
Jupyter Notebook
29
star
18

spc2

Instance-Aware Predictive Navigation in Multi-Agent Environments, ICRA 2021
Python
20
star
19

CISS

Unsupervised condition-level adaptation for semantic segmentation
Python
20
star
20

shift-detection-tta

This repository implements continuous test-time adaptation algorithms for object detection on the SHIFT dataset.
Python
18
star
21

vis4d

A modular library for visual 4D scene understanding
Python
17
star
22

dla-afa

Official implementation of Dense Prediction with Attentive Feature Aggregation, WACV 2023
Python
12
star
23

soccer-player

Python
8
star
24

project-template

Python
4
star
25

vis4d_cuda_ops

Cuda
3
star
26

vis4d-template

Vis4D Template.
Shell
3
star