• Stars
    star
    1,201
  • Rank 38,705 (Top 0.8 %)
  • Language
    Python
  • License
    Other
  • Created about 1 year ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[ICCV 2023] Tracking Anything with Decoupled Video Segmentation

DEVA: Tracking Anything with Decoupled Video Segmentation

titlecard

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee

University of Illinois Urbana-Champaign and Adobe

ICCV 2023

[arXiV] [PDF] [Project Page] Open In Colab

Highlights

  1. Provide long-term, open-vocabulary video segmentation with text-prompts out-of-the-box.
  2. Fairly easy to integrate your own image model! Wouldn't you or your reviewers be interested in seeing examples where your image model also works well on videos 😏? No finetuning is needed!

Note (Sep 12 2023): We have improved automatic video segmentation by not querying the points in segmented regions. We correspondingly increased the number of query points per side to 64 and deprecated the "engulf" mode. The old code can be found in the "legacy_engulf" branch. The new code should run a lot faster and capture smaller objects. The text-prompted mode is still recommended for better results.

Note (Sep 11 2023): We have removed the "pluralize" option as it works weirdly sometimes with GroundingDINO. If needed, please pluralize the prompt yourself.

Abstract

We develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we propose a (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several tasks, most notably in large-vocabulary video panoptic segmentation and open-world video segmentation.

Demo Videos

Demo with Grounded Segment Anything (text prompt: "guinea pigs" and "chicken"):

geinua.mp4

Source: https://www.youtube.com/watch?v=FM9SemMfknA

Demo with Grounded Segment Anything (text prompt: "pigs"):

piglets.mp4

Source: https://youtu.be/FbK3SL97zf8

Demo with Grounded Segment Anything (text prompt: "capybara"):

capybara_ann.mp4

Source: https://youtu.be/couz1CrlTdQ

Demo with Segment Anything (automatic points-in-grid prompting); original video follows DEVA result overlaying the video:

soapbox_joined.mp4

Source: DAVIS 2017 validation set "soapbox"

Demo with Segment Anything on a out-of-domain example; original video follows DEVA result overlaying the video:

green_pepper_joined.mp4

Source: https://youtu.be/FQQaSyH9hZI

Installation

Tested on Ubuntu only. For installation on Windows WSL2, refer to #20 (thanks @21pl).

Prerequisite:

  • Python 3.7+
  • PyTorch 1.12+ and corresponding torchvision

Clone our repository:

git clone https://github.com/hkchengrex/Tracking-Anything-with-DEVA.git

Install with pip:

cd Tracking-Anything-with-DEVA
pip install -e .

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)

Download the pretrained models:

bash scripts/download_models.sh

Required for the text-prompted/automatic demo:

Install our fork of Grounded-Segment-Anything. Follow its instructions.

Grounding DINO installation might fail silently. Try python -c "from groundingdino.util.inference import Model as GroundingDINOModel". If you get a warning about running on CPU mode only, make sure you have CUDA_HOME set during Grounding DINO installation.

(Optional) For fast integer program solving in the semi-online setting:

Get your gurobi licence which is free for academic use. If a license is not found, we fall back to using PuLP which is slower and is not rigorously tested by us. All experiments are conducted with gurobi.

Quick Start

DEMO.md contains more details on the input arguments and tips on speeding up inference. You can always look at deva/inference/eval_args.py and deva/ext/ext_eval_args.py for a full list of arguments.

With gradio:

python demo/demo_gradio.py

Then visit the link that popped up on the terminal. If executing on a remote server, try port forwarding.

We have prepared an example in example/vipseg/12_1mWNahzcsAc (a clip from the VIPSeg dataset). The following two scripts segment the example clip using either Grounded Segment Anything with text prompts or SAM with automatic (points in grid) prompting.

Script (text-prompted):

python demo/demo_with_text.py --chunk_size 4 \
--img_path ./example/vipseg/images/12_1mWNahzcsAc \ 
--amp --temporal_setting semionline \
--size 480 \
--output ./example/output --prompt person.hat.horse

Script (automatic):

python demo/demo_automatic.py --chunk_size 4 \
--img_path ./example/vipseg/images/12_1mWNahzcsAc \ 
--amp --temporal_setting semionline \
--size 480 \
--output ./example/output

Training and Evaluation

  1. Running DEVA with your own detection model.
  2. Running DEVA with detections to reproduce the benchmark results.
  3. Training the DEVA model.

Limitations

  • On closed-set data, DEVA most likely does not work as well as end-to-end approaches. Joint training is (for now) still a better idea when you have enough target data.
  • Positive detections are amplified temporally due to propagation. Having a detector with a lower false positive rate (i.e., a higher threshold) helps.
  • If new objects are coming in and out all the time (e.g., in driving scenes), we will keep a lot of objects in the memory bank which unfortunately increases the false positive rate. Decreasing max_missed_detection_count might help since we delete objects from memory more eagerly.
separator

Citation

@inproceedings{cheng2023tracking,
  title={Tracking Anything with Decoupled Video Segmentation},
  author={Cheng, Ho Kei and Oh, Seoung Wug and Price, Brian and Schwing, Alexander and Lee, Joon-Young},
  booktitle={ICCV},
  year={2023}
}

References

The demo would not be possible without ❀️ from the community:

Grounded Segment Anything: https://github.com/IDEA-Research/Grounded-Segment-Anything

Segment Anything: https://github.com/facebookresearch/segment-anything

XMem: https://github.com/hkchengrex/XMem

Title card generated with OpenPano: https://github.com/ppwwyyxx/OpenPano

More Repositories

1

XMem

[ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
Python
1,669
star
2

CascadePSP

[CVPR 2020] CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement
Python
812
star
3

Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
Python
608
star
4

STCN

[NeurIPS 2021] Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation
Python
533
star
5

MiVOS

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Semi-supervised VOS as well!
Python
458
star
6

Mask-Propagation

[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code 🌟. Semi-supervised video object segmentation evaluation.
Python
127
star
7

Scribble-to-Mask

[CVPR 2021] MiVOS - Scribble to Mask module
Python
88
star
8

vos-benchmark

Fast and general video object segmentation evaluation.
Python
23
star
9

PyTorch-ARCNN

A test script for ARCNN powered by PyTorch.
Python
14
star
10

davis2016-evaluation

Python
8
star
11

Course-Data-Analyser

A project for COMP2021 which analyse data of courses in HKUST
HTML
3
star
12

CharTrans-GAN

Use GAN to perform style transfer of Chinese characters.
TeX
3
star
13

BlenderVOSRenderer

Python
2
star
14

Single-View-Metrology-Step-By-Step

An implementation of Single View Metrology (Criminisi99) with step-by-step guidance in a Jupyter Notebook.
Jupyter Notebook
1
star
15

Android-Matrix-Calculator

A simple android matrix calculator which supports input of unknowns.
Java
1
star
16

VisualChat-Painter-example

Java
1
star
17

htyc-eitc-student

This is a repo for storing and sharing the resources provided to EITC students in HTYC.
Java
1
star
18

kinetics_to_frames

Convert kinetics datasets (or other video datasets) to frames. Support resizing and temporal sampling for space efficiency.
Python
1
star
19

Markov-Next-Word

A next-word prediction program using Markov chain with n-gram written in Go.
Go
1
star