• Stars
    star
    1,010
  • Rank 45,191 (Top 0.9 %)
  • Language
    Python
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Tune-A-Video

This repository is the official implementation of Tune-A-Video.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

Project Website arXiv Hugging Face Spaces Open In Colab


Given a video-text pair as input, our method, Tune-A-Video, fine-tunes a pre-trained text-to-image diffusion model for text-to-video generation.

News

🚨 Announcing LOVEU-TGVE: A CVPR competition for AI-based video editing! Submissions due Jun 5. Don't miss out! 🤩

  • [02/22/2023] Improved consistency using DDIM inversion.
  • [02/08/2023] Colab demo released!
  • [02/03/2023] Pre-trained Tune-A-Video models are available on Hugging Face Library!
  • [01/28/2023] New Feature: tune a video on personalized DreamBooth models.
  • [01/28/2023] Code released!

Setup

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4, v2-1). You can also use fine-tuned Stable Diffusion models trained on different styles (e.g, Modern Disney, Anything V4.0, Redshift, etc.).

[DreamBooth] DreamBooth is a method to personalize text-to-image models like Stable Diffusion given just a few images (3~5 images) of a subject. Tuning a video on DreamBooth models allows personalized text-to-video generation of a specific subject. There are some public DreamBooth models available on Hugging Face (e.g., mr-potato-head). You can also train your own DreamBooth model following this training example.

Usage

Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"

Note: Tuning a 24-frame video usually takes 300~500 steps, about 10~15 minutes using one A100 GPU. Reduce n_sample_frames if your GPU memory is limited.

Inference

Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

pretrained_model_path = "./checkpoints/stable-diffusion-v1-4"
my_model_path = "./outputs/man-skiing"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()

prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=12.5).videos

save_videos_grid(video, f"./{prompt}.gif")

Results

Pretrained T2I (Stable Diffusion)

Input Video Output Video
"A man is skiing" "Spider Man is skiing on the beach, cartoon style” "Wonder Woman, wearing a cowboy hat, is skiing" "A man, wearing pink clothes, is skiing at sunset"
"A rabbit is eating a watermelon on the table" "A rabbit is eating a watermelon on the table" "A cat with sunglasses is eating a watermelon on the beach" "A puppy is eating a cheeseburger on the table, comic style"
"A jeep car is moving on the road" "A Porsche car is moving on the beach" "A car is moving on the road, cartoon style" "A car is moving on the snow"
"A man is dribbling a basketball" "James Bond is dribbling a basketball on the beach" "An astronaut is dribbling a basketball, cartoon style" "A lego man in a black suit is dribbling a basketball"

Pretrained T2I (personalized DreamBooth)

Input Video Output Video
"A bear is playing guitar" "1girl is playing guitar, white hair, medium hair, cat ears, closed eyes, cute, scarf, jacket, outdoors, streets" "1boy is playing guitar, bishounen, casual, indoors, sitting, coffee shop, bokeh" "1girl is playing guitar, red hair, long hair, beautiful eyes, looking at viewer, cute, dress, beach, sea"

Input Video Output Video
"A bear is playing guitar" "A rabbit is playing guitar, modern disney style" "A handsome prince is playing guitar, modern disney style" "A magic princess with sunglasses is playing guitar on the stage, modern disney style"

Input Video Output Video
"A bear is playing guitar" "Mr Potato Head, made of lego, is playing guitar on the snow" "Mr Potato Head, wearing sunglasses, is playing guitar on the beach" "Mr Potato Head is playing guitar in the starry night, Van Gogh style"

Citation

If you make use of our work, please cite our paper.

@article{wu2022tuneavideo,
    title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},
    year={2022}
}

Shoutouts

More Repositories

1

Awesome-Video-Diffusion

A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
2,725
star
2

Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Python
1,073
star
3

Image2Paragraph

[A toolbox for fun.] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.
Python
781
star
4

MotionDirector

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.
Python
747
star
5

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Python
684
star
6

Awesome-MLLM-Hallucination

📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
340
star
7

VideoSwap

Code for [CVPR 2024] VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
318
star
8

all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
Python
277
star
9

BoxDiff

[ICCV 2023] BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
Python
239
star
10

DeVRF

The Pytorch implementation of "DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes"
Python
178
star
11

EgoVLP

[NeurIPS2022] Egocentric Video-Language Pretraining
Python
140
star
12

VisorGPT

[NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT
Python
129
star
13

Awesome-GUI-Agent

💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
109
star
14

Awesome-Unified-Multimodal-Models

📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
106
star
15

ShowAnything

Jupyter Notebook
79
star
16

cosmo

Python
70
star
17

loveu-tgve-2023

Official GitHub repository for the Text-Guided Video Editing (TGVE) competition of LOVEU Workshop @ CVPR'23.
Python
68
star
18

sparseformer

(ICLR 2024, CVPR 2024) SparseFormer
Python
62
star
19

datacentric.vlp

Compress conventional Vision-Language Pre-training data
Python
48
star
20

Region_Learner

The Pytorch implementation for "Video-Text Pre-training with Learned Regions"
Python
42
star
21

ShowRoom3D

This is the project page of ShowRoom3D
24
star
22

Long-form-Video-Prior

Python
22
star
23

DemoVLP

[Arxiv2022] Revitalize Region Feature for Democratizing Video-Language Pre-training
Python
21
star
24

CLVQA

[AAAI2023 (Oral)] Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Python
19
star
25

BYOC

[IEEE-VR 2024] Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters
C#
19
star
26

Q2A

[ECCV 2022] AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
Python
18
star
27

HOSNeRF

This is the project page for the HOSNeRF
JavaScript
15
star
28

headshot

12
star
29

GEB-Plus

[ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Python
12
star
30

Show-Anything-3D

Edit and Generate Anything in 3D world!
11
star
31

Awesome-Long-Context

A curated list of resources about long-context in large-language models and video understanding.
10
star
32

SCT

[IJCV2023] Offical implementation of "SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels"
Python
10
star
33

VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Python
9
star
34

LOVA3

The official repo of "Learning to Visual Question Answering, Asking and Assessment"
Python
9
star
35

SOIS

The Pytorch implementation of "Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization"
8
star
36

AVA-AVD

Python
7
star
37

Efficient-CLS

[arXiv2022] Label-Efficient Online Continual Object Detection in Streaming Video
6
star
38

videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Python
6
star
39

Tune-An-Ellipse

[CVPR 2024] Tune-An-Ellipse: CLIP Has Potential to Find What You Want
6
star
40

mist

5
star
41

ColonNeRF

This is the project page for ColonNeRF.
JavaScript
4
star
42

DynVideo-E

This is the project page for DynVideo-E.
JavaScript
3
star
43

TTC-Tuning

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm
2
star
44

assistq

SCSS
1
star