A Survey on Video Diffusion Models
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
(Source: Make-A-Video, SimDA, PYoCo, SVD , Video LDM and Tune-A-Video)
- [News] We are planning to update the survey soon to encompass the latest work. If you have any suggestions, please feel free to contact us.
- [News] The Chinese translation is available on Zhihu. Special thanks to Dai-Wenxun for this.
Open-source Toolboxes and Foundation Models
Methods | Task | Github |
---|---|---|
Sora | T2V Generation & Editing | - |
VideoPoet | T2V Generation & Editing | - |
Stable Video Diffusion | T2V Generation | |
NeverEnds | T2V Generation | - |
Pika | T2V Generation | - |
EMU-Video | T2V Generation | - |
GEN-2 | T2V Generation & Editing | - |
ModelScope | T2V Generation | |
ZeroScope | T2V Generation | - |
T2V Synthesis Colab | T2V Genetation | |
VideoCraft | T2V Genetation & Editing | |
Diffusers (T2V synthesis) | T2V Genetation | - |
AnimateDiff | Personalized T2V Genetation | |
Text2Video-Zero | T2V Genetation | |
HotShot-XL | T2V Genetation | |
Genmo | T2V Genetation | - |
Fliki | T2V Generation | - |
Table of Contents
Video Generation
Data
Caption-level
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
CelebV-Text: A Large-Scale Facial Text-Video Dataset | - | CVPR, 2023 | ||
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | - | May, 2023 | ||
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation | - | - | May, 2023 | |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | - | - | Nov, 2021 | |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | - | - | ICCV, 2021 | |
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | - | - | CVPR, 2016 |
Category-level
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild | - | - | Dec., 2012 | |
First Order Motion Model for Image Animation | - | - | May, 2023 | |
Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks | - | - | CVPR,2018 |
Metric and BenchMark
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Towards A Better Metric for Text-to-Video Generation | - | Jan, 2024 | ||
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI | - | - | Jan, 2024 | |
VBench: Comprehensive Benchmark Suite for Video Generative Models | Nov, 2023 | |||
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation | - | - | NeurIPS, 2023 | |
CVPR 2023 Text Guided Video Editing Competition | - | - | Oct., 2023 | |
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models | Oct., 2023 | |||
Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset | - | - | Sep., 2023 |
Text-to-Video Generation
Training-based
Training-free
Video Generation with other conditions
Pose-guided Video Generation
Motion-guided Video Generation
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling | - | - | Jan., 2024 | |
Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation | - | - | Jan., 2024 | |
Customizing Motion in Text-to-Video Diffusion Models | - | Dec., 2023 | ||
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models | Nov., 2023 | |||
Motion-Conditioned Diffusion Model for Controllable Video Synthesis | - | Apr., 2023 | ||
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory | - | - | Aug., 2023 |
Sound-guided Video Generation
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion | - | - | ICCV, 2023 | |
Generative Disco: Text-to-Video Generation for Music Visualization | - | - | Apr., 2023 | |
AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion | - | - | CVPRW, 2023 |
Image-guided Video Generation
Brain-guided Video Generation
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
NeuroCine: Decoding Vivid Video Sequences from Human Brain Activties | - | - | Feb., 2024 | |
Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity | NeurIPS, 2023 |
Depth-guided Video Generation
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation | Jul., 2023 | |||
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance | Jun., 2023 |
Multi-modal guided Video Generation
Unconditional Video Generation
U-Net based
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation | - | - | Feb. 2024 | |
Video Probabilistic Diffusion Models in Projected Latent Space | CVPR 2023 | |||
VIDM: Video Implicit Diffusion Models | AAAI 2023 | |||
GD-VDM: Generated Depth for better Diffusion-based Video Generation | - | Jun., 2023 | ||
LEO: Generative Latent Image Animator for Human Video Synthesis | May., 2023 |
Transformer based
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Latte: Latent Diffusion Transformer for Video Generation | Jan., 2024 | |||
VDT: An Empirical Study on Video Diffusion with Transformers | - | May, 2023 |
Video Completion
Video Enhancement and Restoration
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Towards Language-Driven Video Inpainting via Multimodal Large Language Models | Jan., 2024 | |||
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution | - | - | - | WACW, 2023 |
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution | Dec., 2023 | |||
AVID: Any-Length Video Inpainting with Diffusion Model | Dec., 2023 | |||
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution | - | CVPR 2023 | ||
LDMVFI: Video Frame Interpolation with Latent Diffusion Models | - | - | Mar., 2023 | |
CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video Streaming | - | - | Nov., 2022 | |
Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos | - | - | May., 2023 |
Video Prediction
Title | arXiv | Github | Website | Pub. & Date |
---|---|---|---|---|
STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video Prediction | - | Dec, 2023 | ||
Video Diffusion Models with Local-Global Context Guidance | - | IJCAI, 2023 | ||
Seer: Language Instructed Video Prediction with Latent Diffusion Models | - | Mar., 2023 | ||
Diffusion Models for Video Prediction and Infilling | TMLR 2022 | |||
McVd: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | NeurIPS 2022 | |||
Diffusion Probabilistic Modeling for Video Generation | - | Mar., 2022 | ||
Flexible Diffusion Modeling of Long Videos | May, 2022 | |||
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models | May, 2023 |
Video Editing
General Editing Model
Training-free Editing Model
One-shot Editing Model
Instruct-guided Video Editing
Title | arXiv | Github | Website | Pub. Date |
---|---|---|---|---|
Fairy: Fast Parallellized Instruction-Guided Video-to-Video Synthesis | - | Dec, 2023 | ||
Neural Video Fields Editing | Dec, 2023 | |||
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models | Nov, 2023 | |||
Consistent Video-to-Video Transfer Using Synthetic Dataset | - | - | Nov., 2023 | |
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions | - | - | May, 2023 | |
Collaborative Score Distillation for Consistent Visual Synthesis | - | - | July, 2023 |
Motion-guided Video Editing
Title | arXiv | Github | Website | Pub. Date |
---|---|---|---|---|
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation | Nov, 2023 | |||
Drag-A-Video: Non-rigid Video Editing with Point-based Interaction | - | Nov, 2023 | ||
DragVideo: Interactive Drag-style Video Editing | - | Nov, 2023 | ||
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet | - | July, 2023 |
Sound-guided Video Editing
Title | arXiv | Github | Website | Pub. Date |
---|---|---|---|---|
Speech Driven Video Editing via an Audio-Conditioned Diffusion Model | - | - | May., 2023 | |
Soundini: Sound-Guided Diffusion for Natural Video Editing | Apr., 2023 |
Multi-modal Control Editing Model
Title | arXiv | Github | Website | Pub. Date |
---|---|---|---|---|
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion | Dec, 2023 | |||
MagicStick: Controllable Video Editing via Control Handle Transformations | Nov, 2023 | |||
SAVE: Protagonist Diversification with Structure Agnostic Video Editing | - | Nov, 2023 | ||
MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation | - | - | May, 2023 | |
CCEdit: Creative and Controllable Video Editing via Diffusion Models | - | - | Sep, 2023 | |
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts | May, 2023 |
Domain-specific Editing Model
Non-diffusion Editing model
Title | arXiv | Github | Website | Pub. Date |
---|---|---|---|---|
DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing | - | Oct., 2023 | ||
INVE: Interactive Neural Video Editing | - | Jul., 2023 | ||
Shape-Aware Text-Driven Layered Video Editing | - | Jan., 2023 |
Video Understanding
Contact
If you have any suggestions or find our work helpful, feel free to contact us
Homepage: Zhen Xing
Email: [email protected]
If you find our survey is useful in your research or applications, please consider giving us a star ๐ and citing it by the following BibTeX entry.
@article{vdmsurvey,
title={A Survey on Video Diffusion Models},
author={Zhen Xing and Qijun Feng and Haoran Chen and Qi Dai and Han Hu and Hang Xu and Zuxuan Wu and Yu-Gang Jiang},
journal={arXiv preprint arXiv:2310.10647},
year={2023}
}