SVD_Xtend
Stable Video Diffusion Training Code ð
Comparison
size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)
Init Image | Before Fine-tuning | After Fine-tuning |
---|---|---|
Video Data Processing
Note that BDD100K is a driving video/image dataset, but this is not a necessity for training. Any video can be used to initiate your training. Please refer to the DummyDataset
data reading logic. In short, you only need to modify self.base_folder
. Then arrange your videos in the following file structure:
self.base_folder
âââ video_name1
â âââ video_frame1
â âââ video_frame2
â ...
âââ video_name2
â âââ video_frame1
âââ ...
Training Configuration(on the BDD100K dataset)
This training configuration is for reference only, I set all parameters of unet to be trainable during the training and adopted a learning rate of 1e-5.
accelerate launch train_svd.py \
--pretrained_model_name_or_path=/path/to/weight \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=50000 \
--width=512 \
--height=320 \
--checkpointing_steps=1000 --checkpoints_total_limit=1 \
--learning_rate=1e-5 --lr_warmup_steps=0 \
--seed=123 \
--mixed_precision="fp16" \
--validation_steps=200
Disclaimer
While the codebase is functional and provides an enhancement in video generation(maybe? ðĪ·), it's important to note that there are still some uncertainties regarding the finer details of its implementation.
TODO List
- Support text2video (WIP)
- Support more conditional inputs, such as layout
Contribution
Feel free to fork this repository, submit pull requests, or open issues to discuss potential changes or report bugs. With your valuable input, we can continuously improve SVD_Xtend for the community.