MAGVIT: Masked Generative Video Transformer
[Paper] | [Project Page] | [Colab]
Official code and models for the CVPR 2023 paper:
MAGVIT: Masked Generative Video Transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
CVPR 2023
Summary
We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.
If you find this code useful in your research, please cite
@inproceedings{yu2023magvit,
title={{MAGVIT}: Masked generative video transformer},
author={Yu, Lijun and Cheng, Yong and Sohn, Kihyuk and Lezama, Jos{\'e} and Zhang, Han and Chang, Huiwen and Hauptmann, Alexander G and Yang, Ming-Hsuan and Hao, Yuan and Essa, Irfan and Jiang, Lu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}
Disclaimers
Please note that this is not an officially supported Google product.
Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review terms and conditions made available by third parties before using models and datasets provided.
Installation
There is a conda environment file for running with GPUs. CUDA 11 and CuDNN 8.6 is required for JAX. This VM Image has been tested.
conda env create -f environment.yaml
conda activate magvit
Alternatively, you can install the dependencies via
pip install -r requirements.txt
Pretrained models
Model weights and loading instructions are coming soon.
MAGVIT 3D-VQ models
Model | Size | Input | Output | Codebook size | Dataset |
---|---|---|---|---|---|
3D-VQ | B | 16 frames x 64x64 | 4x16x16 | 1024 | BAIR Robot Pushing |
3D-VQ | L | 16 frames x 64x64 | 4x16x16 | 1024 | BAIR Robot Pushing |
3D-VQ | B | 16 frames x 128x128 | 4x16x16 | 1024 | UCF-101 |
3D-VQ | L | 16 frames x 128x128 | 4x16x16 | 1024 | UCF-101 |
3D-VQ | B | 16 frames x 128x128 | 4x16x16 | 1024 | Kinetics-600 |
3D-VQ | L | 16 frames x 128x128 | 4x16x16 | 1024 | Kinetics-600 |
3D-VQ | B | 16 frames x 128x128 | 4x16x16 | 1024 | Something-Something-v2 |
3D-VQ | L | 16 frames x 128x128 | 4x16x16 | 1024 | Something-Something-v2 |
MAGVIT transformers
Each transformer model must be used with its corresponding 3D-VQ tokenizer of the same dataset and model size.
Model | Task | Size | Dataset | FVD |
---|---|---|---|---|
Transformer | Class-conditional | B | UCF-101 | 159 |
Transformer | Class-conditional | L | UCF-101 | 76 |
Transformer | Frame prediction | B | BAIR Robot Pushing | 76 (48) |
Transformer | Frame prediction | L | BAIR Robot Pushing | 62 (31) |
Transformer | Frame prediction (5) | B | Kinetics-600 | 24.5 |
Transformer | Frame prediction (5) | L | Kinetics-600 | 9.9 |
Transformer | Multi-task-8 | B | BAIR Robot Pushing | 32.8 |
Transformer | Multi-task-8 | L | BAIR Robot Pushing | 22.8 |
Transformer | Multi-task-10 | B | Something-Something-v2 | 43.4 |
Transformer | Multi-task-10 | L | Something-Something-v2 | 27.3 |