• Stars
    star
    389
  • Rank 110,500 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 2 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

MM-Diffusion(CVPR 2023)

This is the official PyTorch implementation of the paper MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation, which is accpted by CVPR 2023.

Contents

Introduction

We propose the first joint audio-video generation framework named MM-Diffusion that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. MM-Diffusion consists of a sequential multi-modal U-Net. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises.

Overview

Visual

The generated audio-video examples on landscape:

landscape.mp4

The generated audio-video examples on AIST++:

aist++.mp4

The generated audio-video examples on Audioset:

audioset.mp4

Requirements and dependencies

  • python 3.8 (recommend to use Anaconda)
  • pytorch >= 1.11.0
git clone https://github.com/researchmm/MM-Diffusion.git
cd MM-Diffusion

conda create -n mmdiffusion python=3.8
conda activate mmdiffusion
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch-nightly -c nvidia
conda install mpi4py
pip install -r requirements.txt

Models

Pre-trained models can be downloaded from google drive, and baidu cloud.

  • Landscape.pt: trained on landscape dataset to generate audio-video pairs.

  • Landscape_SR.pt: trained on landscape dataset to upsample frame from reolution 64x64 to 256x256.

  • AIST++.pt: trained on AIST++ dataset to generate audio-video pairs.

  • AIST++_SR.pt: trained on AIST++ dataset to upsample frame from reolution 64x64 to 256x256.

  • guided-diffusion_64_256_upsampler.pt: from guided-diffusion, used as initialization of image SR model.

  • i3d_pretrained_400.pt: model for evaluting videos' FVD and KVD, Manually download to ~/.cache/mmdiffusion/ if the automatic download procedure fails.

  • AudioCLIP-Full-Training.pt: model for evaluting audios; FAD, Manually download to ~/.cache/mmdiffusion/ if the automatic download procedure fails.

Datasets

  1. Landscape
  2. AIST++_crop

The datasets can be downloaded from google drive, and baidu cloud.
We only use the training set for training and evaluation.

You can also run our script on your own dataset by providing the directory path with relevant videos, and the script will capture all videos under the path, regardless of how they are organized.

Test

  1. Download the pre-trained checkpoints.
  2. Download the datasets: Landscape or AIST++_crop.
  3. Modify relative pathes and run generation scripts to generate audio-video pairs.
bash ssh_scripts/multimodal_sample_sr.sh
  1. Modify REF_DIR, SAMPLE_DIR, OUTPUT_DIR and run evaluation scripts.
bash ssh_scripts/multimodal_eval.sh

Train

  1. Prepare training datasets: Landscape or AIST++_crop.
  2. Download datasets: Landscape or AIST++_crop
# Traning Base model
bash ssh_scripts/multimodal_train.sh

# Training Upsampler from 64x64 -> 256x256, first extract videos into frames for SR training, 
bash ssh_scripts/image_sr_train.sh

Conditional Generation

# zero-shot conditional generation: audio-to-video
bash ssh_scripts/audio2video_sample_sr.sh

# zero-shot conditional generation: video-to-audio
bash ssh_scripts/video2audio_sample.sh

Related projects

We also sincerely recommend some other excellent works related to us. ✨

Citation

If you find our work useful for your research, please consider citing our paper. 😊

@inproceedings{ruan2022mmdiffusion,
author = {Ruan, Ludan and Ma, Yiyang and Yang, Huan and He, Huiguo and Liu, Bei and Fu, Jianlong and Yuan, Nicholas Jing and Jin, Qin and Guo, Baining},
title = {MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation},
year	= {2023},
booktitle	= {CVPR},
}

Contact

If you meet any problems, please describe them in issues or contact:

More Repositories

1

TTSR

[CVPR'20] TTSR: Learning Texture Transformer Network for Image Super-Resolution
Python
765
star
2

SiamDW

[CVPR'19 Oral] Deeper and Wider Siamese Networks for Real-Time Visual Tracking
Python
750
star
3

Stark

[ICCV'21] Learning Spatio-Temporal Transformer for Visual Tracking
Python
645
star
4

TracKit

[ECCV'20] Ocean: Object-aware Anchor-Free Tracking
Python
612
star
5

STTN

[ECCV'2020] STTN: Learning Joint Spatial-Temporal Transformations for Video Inpainting
Jupyter Notebook
465
star
6

AOT-GAN-for-Inpainting

[TVCG'2023] AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)
Python
424
star
7

LightTrack

[CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search
Python
396
star
8

PEN-Net-for-Inpainting

[CVPR'2019] PEN-Net: Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting
Python
357
star
9

img2poem

[MM'18] Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training
Python
280
star
10

tasn

Trilinear Attention Sampling Network for Fine-grained Image Recognition
Python
218
star
11

soho

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Python
206
star
12

TTVSR

[CVPR'22 Oral] TTVSR: Learning Trajectory-Aware Transformer for Video Super-Resolution
Python
199
star
13

FTVSR

[ECCV'22] FTVSR: Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution
Python
154
star
14

DBTNet

Code for our NeurIPS'19 paper "Learning Deep Bilinear Transformation for Fine-grained Image Representation"
Python
105
star
15

generate-it

A collection of models for image<->text generation in ACM MM 2021.
Python
64
star
16

CKDN

[ICCV'21] CKDN: Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment
Python
55
star
17

SariGAN

[NeurIPS'20] Learning Semantic-aware Normalization for Generative Adversarial Networks
Python
53
star
18

VOT2019

The Winner and Runner-up Trackers for VOT-2019 Challenges
Python
51
star
19

WSOD2

[ICCV'19] WSOD^2: Learning Bottom-up and Top-down Objectness Distillation for Weakly-supervised Object Detection
Python
47
star
20

VQD-SR

[ICCV'23] VQD-SR: Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution
Python
37
star
21

CyDAS

Cyclic Differentiable Architecture Search
Python
34
star
22

NEAS

Python
19
star
23

2D-TAN

AAAI2020 - Learning 2D Temporal Localization Networks for Moment Localization with Natural Language
Python
17
star
24

STTR

[ACCV'22] Fine-Grained Image Style Transfer with Visual Transformers
Python
14
star
25

AAST-pytorch

[MM'20] Aesthetic-Aware Image Style Transfer
Python
14
star
26

davinci-videofactory

JavaScript
12
star
27

AI_Illustrator

[MM'22 Oral] AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation
Python
11
star
28

language-guided-animation

[TMM 2023] Language-Guided Face Animation by Recurrent StyleGAN-based Generator
Python
11
star