• Stars
    star
    747
  • Rank 60,741 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 1 year ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Rui Zhao · Yuchao Gu · Jay Zhangjie Wu · David Junhao Zhang · Jia-Wei Liu · Weijia Wu · Jussi Keppo · Mike Zheng Shou


Show Lab, National University of Singapore   |   Zhejiang University


MotionDirector can customize text-to-video diffusion models to generate videos with desired motions.

Task Definition

Motion Customization of Text-to-Video Diffusion Models:
Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate diverse videos with this motion.

Demos

Demo Video:

Demo Video of MotionDirector

Customize both Appearance and Motion:

Reference images or videos Videos generated by MotionDirector
Reference images for appearance customization: "A Terracotta Warrior on a pure color background." "A Terracotta Warrior is riding a horse through an ancient battlefield."
seed: 1455028
"A Terracotta Warrior is playing golf in front of the Great Wall."
seed: 5804477
"A Terracotta Warrior is walking cross the ancient army captured with a reverse follow cinematic shot."
seed: 653658
Reference videos for motion customization: "A person is riding a bicycle." "A Terracotta Warrior is riding a bicycle past an ancient Chinese palace."
seed: 166357.
"A Terracotta Warrior is lifting weights in front of the Great Wall."
seed: 5635982
"A Terracotta Warrior is skateboarding."
seed: 9033688

News

ToDo

  • Gradio Demo
  • More trained weights of MotionDirector

Model List

Type Training Data Descriptions Link
MotionDirector for Sports Multiple videos for each model. Learn motion concepts of sports, i.e. lifting weights, riding horse, palying golf, etc. Link
MotionDirector for Cinematic Shots A single video for each model. Learn motion concepts of cinematic shots, i.e. dolly zoom, zoom in, zoom out, etc. Link
MotionDirector for Image Animation A single image for spatial path. And a single video or multiple videos for temporal path. Animate the given image with learned motions. Link
MotionDirector with Customized Appearance A single image or multiple images for spatial path. And a single video or multiple videos for temporal path. Customize both appearance and motion in video generation. Link

Setup

Requirements

# create virtual environment
conda create -n motiondirector python=3.8
conda activate motiondirector
# install packages
pip install -r requirements.txt

Weights of Foundation Models

git lfs install
## You can choose the ModelScopeT2V or ZeroScope, etc., as the foundation model.
## ZeroScope
git clone https://huggingface.co/cerspense/zeroscope_v2_576w ./models/zeroscope_v2_576w/
## ModelScopeT2V
git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope/

Weights of trained MotionDirector

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/ruizhaocv/MotionDirector_weights ./outputs

# More and better trained MotionDirector are released at a new repo:
git clone https://huggingface.co/ruizhaocv/MotionDirector ./outputs
# The usage is slightly different, which will be updated later.

Usage

Training

Train MotionDirector on multiple videos:

python MotionDirector_train.py --config ./configs/config_multi_videos.yaml

Train MotionDirector on a single video:

python MotionDirector_train.py --config ./configs/config_single_video.yaml

Note:

  • Before running the above command, make sure you replace the path to foundational model weights and training data with your own in the config files config_multi_videos.yaml or config_single_video.yaml.
  • Generally, training on multiple 16-frame videos usually takes 300~500 steps, about 9~16 minutes using one A5000 GPU. Training on a single video takes 50~150 steps, about 1.5~4.5 minutes using one A5000 GPU. The required VRAM for training is around 14GB.
  • Reduce n_sample_frames if your GPU memory is limited.
  • Reduce the learning rate and increase the training steps for better performance.

Inference

python MotionDirector_inference.py --model /path/to/the/foundation/model  --prompt "Your prompt" --checkpoint_folder /path/to/the/trained/MotionDirector --checkpoint_index 300 --noise_prior 0.

Note:

  • Replace /path/to/the/foundation/model with your own path to the foundation model, like ZeroScope.
  • The value of checkpoint_index means the checkpoint saved at which the training step is selected.
  • The value of noise_prior indicates how much the inversion noise of the reference video affects the generation. We recommend setting it to 0 for MotionDirector trained on multiple videos to achieve the highest diverse generation, while setting it to 0.1~0.5 for MotionDirector trained on a single video for faster convergence and better alignment with the reference video.

Inference with pre-trained MotionDirector

All available weights are at official Huggingface Repo. Run the download command, the weights will be downloaded to the folder outputs, then run the following inference command to generate videos.

MotionDirector trained on multiple videos:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A person is riding a bicycle past the Eiffel Tower." --checkpoint_folder ./outputs/train/riding_bicycle/ --checkpoint_index 300 --noise_prior 0. --seed 7192280

Note:

  • Replace /path/to/the/ZeroScope with your own path to the foundation model, i.e. the ZeroScope.
  • Change the prompt to generate different videos.
  • The seed is set to a random value by default. Set it to a specific value will obtain certain results, as provided in the table below.

Results:

Reference Videos Videos Generated by MotionDirector
"A person is riding a bicycle." "A person is riding a bicycle past the Eiffel Tower.”
seed: 7192280
"A panda is riding a bicycle in a garden."
seed: 2178639
"An alien is riding a bicycle on Mars."
seed: 2390886

MotionDirector trained on a single video:

16 frames:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A tank is running on the moon." --checkpoint_folder ./outputs/train/car_16/ --checkpoint_index 150 --noise_prior 0.5 --seed 8551187
Reference Video Videos Generated by MotionDirector
"A car is running on the road." "A tank is running on the moon.”
seed: 8551187
"A lion is running past the pyramids."
seed: 431554
"A spaceship is flying past Mars."
seed: 8808231

24 frames:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A truck is running past the Arc de Triomphe." --checkpoint_folder ./outputs/train/car_24/ --checkpoint_index 150 --noise_prior 0.5 --width 576 --height 320 --num-frames 24 --seed 34543
Reference Video Videos Generated by MotionDirector
"A car is running on the road." "A truck is running past the Arc de Triomphe.”
seed: 34543
"An elephant is running in a forest."
seed: 2171736
"A car is running on the road." "A person on a camel is running past the pyramids."
seed: 4904126
"A spacecraft is flying past the Milky Way galaxy."
seed: 3235677

MotionDirector for Sports

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A panda is lifting weights in a garden." --checkpoint_folder ./outputs/train/lifting_weights/ --checkpoint_index 300 --noise_prior 0. --seed 9365597
Videos Generated by MotionDirector
Lifting Weights Riding Bicycle
"A panda is lifting weights in a garden.”
seed: 1699276
"A police officer is lifting weights in front of the police station.”
seed: 6804745
"A panda is riding a bicycle in a garden."
seed: 2178639
"An alien is riding a bicycle on Mars."
seed: 2390886
Riding Horse Riding Horse
"A knight riding on horseback passing by a castle.”
seed: 6491893
"A man riding an elephant through the jungle.”
seed: 6230765
"A girl riding a unicorn galloping under the moonlight."
seed: 6940542
"An adventurer riding a dinosaur exploring through the rainforest."
seed: 6972276
Skateboarding Playing Golf
"A robot is skateboarding in a cyberpunk city.”
seed: 1020673
"A teddy bear skateboarding in Times Square New York.”
seed: 3306353
"A man is playing golf in front of the White House."
seed: 8870450
"A monkey is playing golf on a field full of flowers."
seed: 2989633

More sports, to be continued ...

MotionDirector for Cinematic Shots

1. Zoom

1.1 Dolly Zoom (Hitchcockian Zoom)

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A firefighter standing in front of a burning forest captured with a dolly zoom." --checkpoint_folder ./outputs/train/dolly_zoom/ --checkpoint_index 150 --noise_prior 0.5 --seed 9365597
Reference Video Videos Generated by MotionDirector
"A man standing in room captured with a dolly zoom." "A firefighter standing in front of a burning forest captured with a dolly zoom."
seed: 9365597
noise_prior: 0.5
"A lion sitting on top of a cliff captured with a dolly zoom."
seed: 1675932
noise_prior: 0.5
"A Roman soldier standing in front of the Colosseum captured with a dolly zoom."
seed: 2310805
noise_prior: 0.5
"A man standing in room captured with a dolly zoom." "A firefighter standing in front of a burning forest captured with a dolly zoom."
seed: 4615820
noise_prior: 0.3
"A lion sitting on top of a cliff captured with a dolly zoom."
seed: 4114896
noise_prior: 0.3
"A Roman soldier standing in front of the Colosseum captured with a dolly zoom."
seed: 7492004

1.2 Zoom In

The reference video is shot with my own water cup. You can also pick up your cup or any other object to practice camera movements and turn it into imaginative videos. Create your AI films with customized camera movements!

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A firefighter standing in front of a burning forest captured with a zoom in." --checkpoint_folder ./outputs/train/zoom_in/ --checkpoint_index 150 --noise_prior 0.3 --seed 1429227
Reference Video Videos Generated by MotionDirector
"A cup in a lab captured with a zoom in." "A firefighter standing in front of a burning forest captured with a zoom in."
seed: 1429227
"A lion sitting on top of a cliff captured with a zoom in."
seed: 487239
"A Roman soldier standing in front of the Colosseum captured with a zoom in."
seed: 1393184

1.3 Zoom Out

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A firefighter standing in front of a burning forest captured with a zoom out." --checkpoint_folder ./outputs/train/zoom_out/ --checkpoint_index 150 --noise_prior 0.3 --seed 4971910
Reference Video Videos Generated by MotionDirector
"A cup in a lab captured with a zoom out." "A firefighter standing in front of a burning forest captured with a zoom out."
seed: 4971910
"A lion sitting on top of a cliff captured with a zoom out."
seed: 1767994
"A Roman soldier standing in front of the Colosseum captured with a zoom out."
seed: 8203639

2. Advanced Cinematic Shots

Follow Reverse Follow
"A fireman is walking through fire captured with a follow cinematic shot.”
seed: 4926511
"A spaceman is walking on the moon with a follow cinematic shot.”
seed: 7594623
"A fireman is walking through fire captured with a reverse follow cinematic shot.”
seed: 9759630
"A spaceman walking on the moon captured with a reverse follow cinematic shot."
seed: 4539309
Chest Transition Mini Jib Reveal: Foot-to-Head Shot
"A fireman is walking through the burning forest captured with a chest transition cinematic shot.”
seed: 5236349
"An ancient Roman soldier walks through the crowd on the street captured with a chest transition cinematic shot.”
seed: 3982271
"An ancient Roman soldier walks through the crowd on the street captured with a mini jib reveal cinematic shot.”
seed: 654178
"A British Redcoat soldier is walking through the mountains captured with a mini jib reveal cinematic shot."
seed: 566917
Pull Back: Subject Enters form the Left Orbit
"A robot looks at a distant cyberpunk city captured with a pull back cinematic shot.”
seed: 9342597
"A woman looks at a distant erupting volcano captured with a pull back cinematic shot.”
seed: 4197508
"A fireman in the burning forest captured with an orbit cinematic shot.”
seed: 8450300
"A spaceman on the moon captured with an orbit cinematic shot."
seed: 5899496

More Cinematic Shots, to be continued ....

MotionDirector for Image Animation

Train

Train the spatial path with reference image.

python MotionDirector_train.py --config ./configs/config_single_image.yaml

Then train the temporal path to learn the motion in reference video.

python MotionDirector_train.py --config ./configs/config_single_video.yaml

Inference

Inference with spatial path learned from reference image and temporal path learned form reference video.

python MotionDirector_inference_multi.py --model /path/to/the/foundation/model  --prompt "Your prompt" --spatial_path_folder /path/to/the/trained/MotionDirector/spatial/lora/ --temporal_path_folder /path/to/the/trained/MotionDirector/temporal/lora/ --noise_prior 0.

Example

Download the pre-trained weights.

git clone https://huggingface.co/ruizhaocv/MotionDirector ./outputs

Run the following command.

python MotionDirector_inference_multi.py --model /path/to/the/ZeroScope  --prompt "A car is running on the road." --spatial_path_folder ./outputs/train/image_animation/train_2023-12-26T14-37-16/checkpoint-300/spatial/lora/ --temporal_path_folder ./outputs/train/image_animation/train_2023-12-26T13-08-20/checkpoint-300/temporal/lora/ --noise_prior 0.5 --seed 5057764
Reference Image Reference Video Videos Generated by MotionDirector
"A car is running on the road." "A car is running on the road." "A car is running on the road."
seed: 5057764
"A car is running on the road covered with snow."
seed: 4904543

MotionDirector with Customized Appearance

Train

Train the spatial path with reference images.

python MotionDirector_train.py --config ./configs/config_multi_images.yaml

Then train the temporal path to learn the motions in reference videos.

python MotionDirector_train.py --config ./configs/config_multi_videos.yaml

Inference

Inference with spatial path learned from reference images and temporal path learned form reference videos.

python MotionDirector_inference_multi.py --model /path/to/the/foundation/model  --prompt "Your prompt" --spatial_path_folder /path/to/the/trained/MotionDirector/spatial/lora/ --temporal_path_folder /path/to/the/trained/MotionDirector/temporal/lora/ --noise_prior 0.

Example

Download the pre-trained weights.

git clone https://huggingface.co/ruizhaocv/MotionDirector ./outputs

Run the following command.

python MotionDirector_inference_multi.py --model /path/to/the/ZeroScope  --prompt "A Terracotta Warrior is riding a horse through an ancient battlefield." --spatial_path_folder ./outputs/train/customized_appearance/terracotta_warrior/checkpoint-default/spatial/lora --temporal_path_folder ./outputs/train/riding_horse/checkpoint-default/temporal/lora/ --noise_prior 0. --seed 1455028

Results are shown in the table.

More results

If you have a more impressive MotionDirector or generated videos, please feel free to open an issue and share them with us. We would greatly appreciate it. Improvements to the code are also highly welcome.

Please refer to Project Page for more results.

Astronaut's daily life on Mars:

Astronaut's daily life on Mars (Motion concepts learned by MotionDirector)
Lifting Weights Playing Golf Riding Horse Riding Bicycle
"An astronaut is lifting weights on Mars, 4K, high quailty, highly detailed.”
seed: 4008521
"Astronaut playing golf on Mars”
seed: 659514
"An astronaut is riding a horse on Mars, 4K, high quailty, highly detailed."
seed: 1913261
"An astronaut is riding a bicycle past the pyramids Mars, 4K, high quailty, highly detailed."
seed: 5532778
Skateboarding Cinematic Shot: "Reverse Follow" Cinematic Shot: "Follow" Cinematic Shot: "Orbit"
"An astronaut is skateboarding on Mars"
seed: 6615212
"An astronaut is walking on Mars captured with a reverse follow cinematic shot."
seed: 1224445
"An astronaut is walking on Mars captured with a follow cinematic shot."
seed: 6191674
"An astronaut is standing on Mars captured with an orbit cinematic shot."
seed: 7483453

Citation

@article{zhao2023motiondirector,
  title={MotionDirector: Motion Customization of Text-to-Video Diffusion Models},
  author={Zhao, Rui and Gu, Yuchao and Wu, Jay Zhangjie and Zhang, David Junhao and Liu, Jiawei and Wu, Weijia and Keppo, Jussi and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2310.08465},
  year={2023}
}

Shoutouts

More Repositories

1

Awesome-Video-Diffusion

A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
3,195
star
2

Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Python
1,089
star
3

Tune-A-Video

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Python
1,010
star
4

Image2Paragraph

[A toolbox for fun.] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.
Python
781
star
5

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Python
684
star
6

VideoSwap

Code for [CVPR 2024] VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
342
star
7

Awesome-MLLM-Hallucination

📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
340
star
8

all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
Python
277
star
9

BoxDiff

[ICCV 2023] BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
Python
239
star
10

DeVRF

The Pytorch implementation of "DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes"
Python
179
star
11

EgoVLP

[NeurIPS2022] Egocentric Video-Language Pretraining
Python
140
star
12

VisorGPT

[NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT
Python
129
star
13

Awesome-GUI-Agent

💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
109
star
14

Awesome-Unified-Multimodal-Models

📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
106
star
15

ShowAnything

Jupyter Notebook
79
star
16

cosmo

Python
70
star
17

loveu-tgve-2023

Official GitHub repository for the Text-Guided Video Editing (TGVE) competition of LOVEU Workshop @ CVPR'23.
Python
68
star
18

sparseformer

(ICLR 2024, CVPR 2024) SparseFormer
Python
62
star
19

datacentric.vlp

Compress conventional Vision-Language Pre-training data
Python
48
star
20

Region_Learner

The Pytorch implementation for "Video-Text Pre-training with Learned Regions"
Python
42
star
21

ShowRoom3D

This is the project page of ShowRoom3D
24
star
22

Long-form-Video-Prior

Python
22
star
23

DemoVLP

[Arxiv2022] Revitalize Region Feature for Democratizing Video-Language Pre-training
Python
21
star
24

CLVQA

[AAAI2023 (Oral)] Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Python
19
star
25

BYOC

[IEEE-VR 2024] Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters
C#
19
star
26

Q2A

[ECCV 2022] AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
Python
18
star
27

HOSNeRF

This is the project page for the HOSNeRF
JavaScript
15
star
28

headshot

12
star
29

GEB-Plus

[ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Python
12
star
30

LOVA3

[NeurIPS 2024] "Learning to Visual Question Answering, Asking and Assessment"
Python
12
star
31

Show-Anything-3D

Edit and Generate Anything in 3D world!
11
star
32

Awesome-Long-Context

A curated list of resources about long-context in large-language models and video understanding.
10
star
33

SCT

[IJCV2023] Offical implementation of "SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels"
Python
10
star
34

VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Python
9
star
35

SOIS

The Pytorch implementation of "Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization"
8
star
36

AVA-AVD

Python
7
star
37

Efficient-CLS

[arXiv2022] Label-Efficient Online Continual Object Detection in Streaming Video
6
star
38

videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Python
6
star
39

Tune-An-Ellipse

[CVPR 2024] Tune-An-Ellipse: CLIP Has Potential to Find What You Want
6
star
40

mist

5
star
41

ColonNeRF

This is the project page for ColonNeRF.
JavaScript
4
star
42

DynVideo-E

This is the project page for DynVideo-E.
JavaScript
3
star
43

VideoLISA

[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
3
star
44

TTC-Tuning

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm
2
star
45

assistq

SCSS
1
star