Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning (CVPR 2022)
MMVIDProject | arXiv | PDF | Dataset
This repo contains the code for training and testing, models, and data for MMVID.
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
Ligong Han, Jian Ren, Hsin-Ying Lee, Francesco Barbieri, Kyle Olszewski, Shervin Minaee, Dimitris Metaxas, Sergey Tulyakov
Snap Inc., Rutgers University
CVPR 2022
MMVID Code
CLIP model
Download OpenAI's pretrained CLIP model and place it under ./
(or any other directory that is consistent with arg --openai_clip_model_path
),
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
VQGAN
Code for finetuning VQGAN models is provided in this repo.
Multimodal VoxCeleb
For testing, please download pretrained models and change the path for --dalle_path
in the scripts.
For quantitative evaluation, append --eval_mode eval
to each testing command. Output log directory can be changed by appending --name_suffix _fvd
to add suffix (example here).
Text-to-Video
Training:
bash scripts/mmvoxceleb/text_to_video/train.sh
Testing:
bash scripts/mmvoxceleb/text_to_video/test.sh
For Quantitative Evaluation (FVD and PRD):
bash scripts/mmvoxceleb/text_to_video/evaluation.sh
Text Augmentation
Text augmentation for better training. To enable using a pretrained RoBERTa model, append --fixed_language_model roberta-large
to the training/testing command. Note that this feature is only experimental and is not very robust.
To enable text dropout, append --drop_sentence
to the training command. Text dropout is also compatible with using a RoBERTa. We observed that text dropout genrally improves diversity in the generated videos.
Training:
bash scripts/mmvoxceleb/text_augement/train.sh
Testing:
bash scripts/mmvoxceleb/text_augement/test.sh
Text and Mask
Training:
bash scripts/mmvoxceleb/text_and_mask/train.sh
Testing:
bash scripts/mmvoxceleb/text_and_mask/test.sh
Text and Drawing
Training:
bash scripts/mmvoxceleb/text_and_drawing/train.sh
Testing:
bash scripts/mmvoxceleb/text_and_drawing/test.sh
Drawing and Mask
Training:
bash scripts/mmvoxceleb/drawing_and_mask/train.sh
Testing:
bash scripts/mmvoxceleb/drawing_and_mask/test.sh
Image and Mask
Training:
bash scripts/mmvoxceleb/image_and_mask/train.sh
Testing:
bash scripts/mmvoxceleb/image_and_mask/test.sh
Text and Partial Image
Training:
bash scripts/mmvoxceleb/image_and_mask/train.sh
Testing:
bash scripts/mmvoxceleb/image_and_mask/test.sh
Image and Video
Training:
bash scripts/mmvoxceleb/image_and_mask/train.sh
Testing:
bash scripts/mmvoxceleb/image_and_mask/test.sh
Pretrained Models
Pretrained models are provided here.
Multimodal VoxCeleb
Weight | FVD | |
---|---|---|
VQGAN (vae) | ckpt | - |
VQGAN (cvae for image conditiong) | ckpt | - |
Text-to-Video | pt | 59.46 |
Text-to-Video (ARTV) | pt | 70.95 |
Text and Mask | pt | - |
Text and Drawing | pt | - |
Drawing and Mask | pt | - |
Image and Mask | pt | - |
Text and Partial Image | pt | - |
Image and Video | pt | - |
Text-Augmentation | pt | - |
Multimodal VoxCeleb Dataset
Multimodal VoxCeleb Dataset has a total of 19,522 videos with 3,437 various interview situations (453 people). Please see details about how to prepare the dataset in mm_vox_celeb/README.md
. Preprocessed data is also available here.
Acknowledgement
This code is heavily based on DALLE-PyTorch and uses CLIP, Taming Transformer, Precision Recall Distribution, Frechet Video Distance, Facenet-PyTorch, Face Parsing, and Unpaired Portrait Drawing.
The authors thank everyone who makes their code and models available.
Citation
If our code, data, or models help your work, please cite our paper:
@inproceedings{han2022show,
title={Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning},
author={Han, Ligong and Ren, Jian and Lee, Hsin-Ying and Barbieri, Francesco and Olszewski, Kyle and Minaee, Shervin and Metaxas, Dimitris and Tulyakov, Sergey},
booktitle={CVPR},
year={2022}
}