Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation (CVPR-2022, oral)

Paper, Sides, Poster, Video

Xiangtai Li, Wenwei Zhang, Jiangmiao Pang, Kai Chen, Guangliang Cheng, Yunhai Tong, Chen Change Loy.

We introduce Video K-Net, a simple, strong, and unified framework for fully end-to-end dense video segmentation.

The method is built upon K-Net, a method of unifying image segmentation via a group of learnable kernels.

This project contains the training and testing code of Video K-Net for both VPS (Video Panoptic Segmentation), VSS(Video Semantic Segmentation), VIS(Video Instance Segmentation).

To the best of our knowledge, our Video K-Net is the first open-sourced method that supports three different video segmentation tasks (VIS, VPS, VSS) for Video Scene Understanding.

News! Video K-Net is acknowledged as a strong baseline for CVPR-2023 workshop "The 2nd Pixel-level Video Understanding in the Wild".

News! Video K-Net also supports VIP-Seg dataset(CVPR-2022). It also achieves the new state-of-the-art result.

Environment and DataSet Preparation

Our codebase is based on MMDetection and MMSegmentation. Parts of the code is borrowed from MMtracking and UniTrack.

MIM >= 0.1.1
MMCV-full >= v1.3.8
MMDetection == v2.18.0
timm
scipy
panopticapi

See the DATASET.md

knet folder contains the Video K-Net for VPS.

knet_vis folder contains the Video K-Net for VIS.

Pretrained CKPTs and Trained Models

We provide the pretrained models for VPS and VIS.

Baidu Yun Link: here Code:i034

One Drive Link: here

The pretrained models are provided to train the Video K-Net.

The trained models are also provided for play and test.

[VPS] KITTI-STEP

First pretrain K-Net on Cityscapes-STEP datasset. As shown in original STEP paper(Appendix Part) and our own EXP results, this step is very important to improve the segmentation performance. You can also use our trained model for verification.

Cityscape-STEP follows the format of STEP: 17 stuff classes and 2 thing classes.

# train cityscapes step panoptic segmentation models
sh ./tools/slurm_train.sh $PARTITION knet_step configs/det/knet_cityscapes_step/knet_s3_r50_fpn.py $WORK_DIR --no-validate

Then train the Video K-Net on KITTI-STEP. We have provided the pretrained models from Cityscapes of Video K-Net.

For slurm users:

# train Video K-Net on KITTI-step using R-50
GPUS=8 sh ./tools/slurm_train.sh $PARTITION video_knet_step configs/det/video_knet_kitti_step/video_knet_s3_r50_rpn_1x_kitti_step_sigmoid_stride2_mask_embed_link_ffn_joint_train.py $WORK_DIR --no-validate --load-from /path_to_knet_step_city_r50

# train Video K-Net on KITTI-step using Swin-base
GPUS=16 GPUS_PER_NODE=8 sh ./tools/slurm_train.sh $PARTITION video_knet_step configs/det/video_knet_kitti_step/video_knet_s3_swinb_rpn_1x_kitti_step_sigmoid_stride2_mask_embed_link_ffn_joint_train.py $WORK_DIR --no-validate --load-from /path_to_knet_step_city_r50

Our models are trained with two V100 machines.

For Local machine:

# train Video K-Net on KITTI-step with 8 GPUs
sh ./tools/dist_train.sh video_knet_step configs/det/video_knet_kitti_step/video_knet_s3_r50_rpn_1x_kitti_step_sigmoid_stride2_mask_embed_link_ffn_joint_train.py 8 $WORK_DIR --no-validate

Testing and Demo.

We provide both VPQ and STQ metrics to evaluate VPS models.

# test locally 
sh ./tools/dist_step_test.sh configs/det/knet_cityscapes_ste/knet_s3_r50_fpn.py $MODEL_DIR

We also dump the colored images for debug.

# eval STEP STQ
python tools/eval_dstq_step.py result_path gt_path

# eval STEP VPQ
python tools/eval_dvpq_step.py result_path gt_path

Toy Video K-Net

As shown in the paper, we also provide toy video K-Net in knet/video/knet_quansi_dense_embed_fc_toy_exp.py. You use the K-Net pre-trained on image-level KITTI-STEP without tracking.

[VIS] YouTube-VIS-2019

First Download the pre-trained Image K-Net instance segmentation models. All the models are pretrained on COCO which is a common. You can also pretrain it by yourself. We also provide the config for pretraining.

For slurm users:

# train K-Net instance segmentation models on COCO using R-50
GPUS=8 sh ./tools/slurm_train.sh $PARTITION knet_instance configs/det/coco/knet_s3_r50_fpn_ms-3x_coco.py $WORK_DIR

Then train the video K-Net in a clip-wised manner.

# train Video K-Net VIS models using R-50
GPUS=8 sh ./tools/slurm_train.sh $PARTITION video_knet_vis configs/video_knet_vis/video_knet_vis/knet_track_r50_1x_youtubevis.py $WORK_DIR --load-from /path_to_knet_instance_coco

To evaluate the results of Video K-Net on VIS. Dump the prediction results for submission to the conda server.

# test Video K-Net VIS models using R-50
GPUS=8 sh tools_vis/dist_test_whole_video.sh $PARTITION video_knet_vis configs/video_knet_vis/video_knet_vis/knet_track_r50_1x_youtubevis.py $WORK_DIR --format-only

The result json is dumped into the root of this codebase.

[VPS] VIP-Seg

First Download the pre-trained Image K-Net panoptic segmentation models. All the models are pretrained on COCO which is a common step following VIP-Seg. You can also pretrain it by yourself. We also provide the config for pretraining.

# train K-Net on COCO Panoptic Segmetnation
GPUS=8 sh ./tools/slurm_train.sh $PARTITION knet_coco configs/det/coco/knet_s3_r50_fpn_ms-3x_coco-panoptic.py $WORK_DIR

Train the Video K-Net on the VIP-Seg dataset.

# train Video K-Net on VIP-Seg
GPUS=8 sh ./tools/slurm_train.sh $PARTITION video_knet_vis configs/det/video_knet_vipseg/video_knet_s3_r50_rpn_vipseg_mask_embed_link_ffn_joint_train.py $WORK_DIR --load-from /path/knet_coco_pretrained_r50

Test the Video K-Net on VIP-Seg val dataset.

# test locally on VIP-Seg
sh ./tools/dist_step_test.sh configs/det/video_knet_vipseg/video_knet_s3_r50_rpn_vipseg_mask_embed_link_ffn_joint_train.py $MODEL_DIR

We also dump the colored images for debug.

# eval STEP STQ
python tools/eval_dstq_vipseg.py result_path gt_path

# eval STEP VPQ
python tools/eval_dvpq_vipseg.py result_path gt_path

Visualization Results

Results on KITTI-STEP DataSet

Results on VIP-Seg DataSet

Results on YouTube-VIS DataSet

Short term segmentation and tracking results on Cityscapes VPS dataset.

images(left), Video K-Net(middle), Ground Truth

Long term segmentation and tracking results on STEP dataset.

Related Project and Acknowledgement

Citing Video K-Net 🙏

If you use our codebase in your research or used for CVPR-2023 pixel-level video workshop, please use the following BibTeX entry.

NIPS-2021, K-Net: Unified Segmentation: Our Image baseline (https://github.com/ZwwWayne/K-Net)

ECCV-2022, PolyphonicFormer: A Unified Framework For Panoptic Segmentation + Depth Estimation (winner of ICCV-2021 BMTT workshop) (https://github.com/HarborYuan/PolyphonicFormer)

@inproceedings{li2022videoknet,
  title={Video k-net: A simple, strong, and unified baseline for video segmentation},
  author={Li, Xiangtai and Zhang, Wenwei and Pang, Jiangmiao and Chen, Kai and Cheng, Guangliang and Tong, Yunhai and Loy, Chen Change},
  booktitle={CVPR},
  year={2022}
}

@article{zhang2021k,
  title={K-net: Towards unified image segmentation},
  author={Zhang, Wenwei and Pang, Jiangmiao and Chen, Kai and Loy, Chen Change},
  journal={NeurIPS},
  year={2021}
}

lxtGH/Video-K-Net

lxtGH

Reviews

Repository Details