MoCoGAN-HD

Project | OpenReview | arXiv | Talk | Slides

(AFHQ, VoxCeleb)

Pytorch implementation of our method for high-resolution (e.g. 1024x1024) and cross-domain video synthesis.
A Good Image Generator Is What You Need for High-Resolution Video Synthesis
Yu Tian¹, Jian Ren², Menglei Chai², Kyle Olszewski², Xi Peng³, Dimitris N. Metaxas¹, Sergey Tulyakov²
¹Rutgers Univeristy, ²Snap Inc., ³University of Delaware
In ICLR 2021, Spotlight.

Pre-trained Image Generator & Video Datasets

In-domain Video Synthesis

UCF-101: image generator, video data, motion generator
FaceForensics: image generator, video data, motion generator
Sky-Timelapse: image generator, video data, motion generator

Cross-domain Video Synthesis

(FFHQ, VoxCeleb): FFHQ image generator, VoxCeleb, motion generator
(AFHQ, VoxCeleb): AFHQ image generator, VoxCeleb, motion generator
(Anime, VoxCeleb): Anime image generator, VoxCeleb, motion generator
(FFHQ-1024, VoxCeleb): FFHQ-1024 image generator, VoxCeleb, motion generator
(LSUN-Church, TLVDB): LSUN-Church image generator, TLVDB

Calculated pca stats are saved here.

Training

Organise the video dataset as follows:

Video dataset
|-- video1
    |-- img_0000.png
    |-- img_0001.png
    |-- img_0002.png
    |-- ...
|-- video2
    |-- img_0000.png
    |-- img_0001.png
    |-- img_0002.png
    |-- ...
|-- video3
    |-- img_0000.png
    |-- img_0001.png
    |-- img_0002.png
    |-- ...
|-- ...

In-domain Video Synthesis

UCF-101

Collect the PCA components from a pre-trained image generator.

python get_stats_pca.py --batchSize 4000 \
  --save_pca_path pca_stats/ucf_101 \
  --pca_iterations 250 \
  --latent_dimension 512 \
  --img_g_weights /path/to/ucf_101_image_generator \
  --style_gan_size 256 \
  --gpu 0

Train the model

python -W ignore train.py --name ucf_101 \
  --time_step 2 \
  --lr 0.0001 \
  --save_pca_path pca_stats/ucf_101 \
  --latent_dimension 512 \
  --dataroot /path/to/ucf_101 \
  --checkpoints_dir checkpoints/ucf_101 \
  --img_g_weights /path/to/ucf_101_image_generator \
  --multiprocessing_distributed --world_size 1 --rank 0 \
  --batchSize 16 \
  --workers 8 \
  --style_gan_size 256 \
  --total_epoch 100 \

Inference

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ucf_101 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ucf_101_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch the_epoch_for_testing (should >= 0) \
  --results results/ucf_101 \
  --num_test_videos 10 \

FaceForensics

Collect the PCA components from a pre-trained image generator.

sh script/faceforensics/run_get_stats_pca.sh

Train the model

sh script/faceforensics/run_train.sh

Inference

sh script/faceforensics/run_evaluate.sh

Sky-Timelapse

Collect the PCA components from a pre-trained image generator.

sh script/sky_timelapse/run_get_stats_pca.sh

Train the model

sh script/sky_timelapse/run_train.sh

Inference

sh script/sky_timelapse/run_evaluate.sh

Cross-domain Video Synthesis

(FFHQ, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

python get_stats_pca.py --batchSize 4000 \
  --save_pca_path pca_stats/ffhq_256 \
  --pca_iterations 250 \
  --latent_dimension 512 \
  --img_g_weights /path/to/ffhq_image_generator \
  --style_gan_size 256 \
  --gpu 0

Train the model

python -W ignore train.py --name ffhq_256-voxel \
  --time_step 2 \
  --lr 0.0001 \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --dataroot /path/to/voxel_dataset \
  --checkpoints_dir checkpoints \
  --img_g_weights /path/to/ffhq_image_generator \
  --multiprocessing_distributed --world_size 1 --rank 0 \
  --batchSize 16 \
  --workers 8 \
  --style_gan_size 256 \
  --total_epoch 25 \
  --cross_domain \

Inference

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ffhq_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch the_epoch_for_testing (should >= 0) \
  --results results/ffhq_256 \
  --num_test_videos 10 \

(FFHQ-1024, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

sh script/ffhq-vox/run_get_stats_pca_1024.sh

Train the model

sh script/ffhq-vox/run_train_1024.sh

Inference

sh script/ffhq-vox/run_evaluate_1024.sh

(AFHQ, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

sh script/afhq-vox/run_get_stats_pca.sh

Train the model

sh script/afhq-vox/run_train.sh

Inference

sh script/afhq-vox/run_evaluate.sh

(Anime, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

sh script/anime-vox/run_get_stats_pca.sh

Train the model

sh script/anime-vox/run_train.sh

Inference

sh script/anime-vox/run_evaluate.sh

(LSUN-Church, TLVDB)

Collect the PCA components from a pre-trained image generator.

sh script/lsun_church-tlvdb/run_get_stats_pca.sh

Train the model

sh script/lsun_church-tlvdb/run_train.sh

Inference

sh script/lsun_church-tlvdb/run_evaluate.sh

Fine-tuning

If you wish to resume interupted training or fine-tune a pre-trained model, run (use UCF-101 as an example):

python -W ignore train.py --name ucf_101 \
  --time_step 2 \
  --lr 0.0001 \
  --save_pca_path pca_stats/ucf_101 \
  --latent_dimension 512 \
  --dataroot /path/to/ucf_101 \
  --checkpoints_dir checkpoints \
  --img_g_weights /path/to/ucf_101_image_generator \
  --multiprocessing_distributed --world_size 1 --rank 0 \
  --batchSize 16 \
  --workers 8 \
  --style_gan_size 256 \
  --total_epoch 100 \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch 0

Training Control With Options

--w_residual controls the step of motion residual, default value is 0.2, we recommand <= 0.5
--n_pca # of PCA basis, used in the motion residual calculation, default value is 384 (out of 512 dim of StyleGAN2 w space), we recommand >= 256
--q_len size of queue to save logits used in constrastive loss, default value is 4,096
--video_frame_size spatial size of video frames for training, all synthesized video clips will be down-sampled to this size before feeding to the video discriminator, default value is 128, larger size may lead to better motion modeling
--cross_domain activate for cross-domain video synthesis, default value is False
--w_match weight for feature matching loss, default value is 1.0, large value improves content matching

Long Sequence Generation

LSTM Unrolling

In inference, you can generate long sequence by LSTM unrolling with --n_frames_G

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ffhq_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch 0 \
  --n_frames_G 32

Interpolation

In inference, you can generate long sequence by interpolation with --interpolation

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ffhq_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch 0 \
  --interpolation

Examples of Generated Videos

UCF-101

FaceForensics

Sky Timelapse

(FFHQ, VoxCeleb)

(FFHQ-1024, VoxCeleb)

(Anime, VoxCeleb)

(LSUN-Church, TLVDB)

Citation

If you use the code for your work, please cite our paper.

@inproceedings{
tian2021a,
title={A Good Image Generator Is What You Need for High-Resolution Video Synthesis},
author={Yu Tian and Jian Ren and Menglei Chai and Kyle Olszewski and Xi Peng and Dimitris N. Metaxas and Sergey Tulyakov},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=6puCSjH3hwA}
}

Acknowledgments

This code borrows StyleGAN2 Image Generator, BigGAN Discriminator, PatchGAN Discriminator.

snap-research/MoCoGAN-HD

snap-research

Reviews

Repository Details

MoCoGAN-HD

Project | OpenReview | arXiv | Talk | Slides

(AFHQ, VoxCeleb)

Pre-trained Image Generator & Video Datasets

In-domain Video Synthesis

Cross-domain Video Synthesis

Training

In-domain Video Synthesis

UCF-101

FaceForensics

Sky-Timelapse

Cross-domain Video Synthesis

(FFHQ, VoxCeleb)

(FFHQ-1024, VoxCeleb)

(AFHQ, VoxCeleb)

(Anime, VoxCeleb)

(LSUN-Church, TLVDB)

Fine-tuning

Training Control With Options

Long Sequence Generation

LSTM Unrolling

Interpolation

Examples of Generated Videos

UCF-101

FaceForensics

Sky Timelapse

(FFHQ, VoxCeleb)

(FFHQ-1024, VoxCeleb)

(Anime, VoxCeleb)

(LSUN-Church, TLVDB)

Citation

Acknowledgments

More Repositories