• Stars
    star
    1,601
  • Rank 29,019 (Top 0.6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created 7 months ago
  • Updated 13 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

VideoSys: An easy and efficient system for video generation

OpenDiT

An Easy, Fast and Memory-Efficient System for DiT Training and Inference

[Homepage] | [Discord] | [WeChat] | [Twitter] | [Zhihu] | [Media]

Latest News πŸ”₯

  • [2024/03/20] Propose Dynamic Sequence Parallel (DSP)[paper][doc], achieves 3x speed for training and 2x speed for inference in OpenSora compared with sota sequence parallelism.
  • [2024/03/18] Support OpenSora: Democratizing Efficient Video Production for All.
  • [2024/02/27] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.

About

OpenDiT is an open-source project that provides a high-performance implementation of Diffusion Transformer (DiT) powered by Colossal-AI, specifically designed to enhance the efficiency of training and inference for DiT applications, including text-to-video generation and text-to-image generation.

OpenDiT has been adopted by OpenSora, MiniSora.

OpenDiT boasts the performance by the following techniques:

  1. Up to 80% speedup and 50% memory reduction on GPU
    • Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
    • Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
  2. FastSeq: A novel sequence parallelism method
    • Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
    • Up to 48% communication save for intra-node sequence parallel.
    • Break the memory limitation of a single GPU and reduce the overall training and inference time.
  3. Ease of use
    • Huge performance improvement gains with a few line changes
    • Users do not need to know the implementation of distributed training.
  4. Complete pipeline of text-to-image and text-to-video generation
    • Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
    • Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.

end2end

Authors: Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, Yang You

OpenDiT will continue to integrate more open-source DiT models. Stay tuned for upcoming enhancements and additional features!

Installation

Prerequisites:

  • Python >= 3.10
  • PyTorch >= 1.13 (We recommend to use a >2.0 version)
  • CUDA >= 11.6

We strongly recommend using Anaconda to create a new environment (Python >= 3.10) to run our examples:

conda create -n opendit python=3.10 -y
conda activate opendit

Install ColossalAI:

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
git checkout adae123df3badfb15d044bd416f0cf29f250bc86
pip install -e .

Install OpenDiT:

git clone https://github.com/oahzxl/OpenDiT
cd OpenDiT
pip install -e .

(Optional but recommended) Install libraries for training & inference speed up (you can run our code without these libraries):

# Install Triton for fused adaln kernel
pip install triton

# Install FlashAttention
pip install flash-attn

# Install apex for fused layernorm kernel
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout 741bdf50825a97664db08574981962d66436d16a
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"

Usage

Here are our supported models and their usage:

Model Source Function Usage Optimize
DiT https://github.com/facebookresearch/DiT label-to-image Usage βœ…
OpenSora https://github.com/hpcaitech/Open-Sora text-to-video Usage βœ…

Technique Overview

DSP [paper][doc]

dsp_overview

DSP (Dynamic Sequence Parallelism) is a novel, elegant and super efficient sequence parallelism for OpenSora, Latte and other multi-dimensional transformer architecture.

It achieves 3x speed for training and 2x speed for inference in OpenSora compared with sota sequence parallelism (DeepSpeed Ulysses). For a 10s (80 frames) of 512x512 video, the inference latency of OpenSora is:

Method 1xH800 8xH800 (DS Ulysses) 8xH800 (DSP)
Latency(s) 106 45 22

See its detail and usage here.


FastSeq [doc]

fastseq_overview

FastSeq is a novel sequence parallelism for large sequences and small-scale parallelism.

It focuses on minimizing sequence communication by employing only two communication operators for every transformer layer, and we an async ring to overlap AllGather communication with qkv computation. See its detail and usage here.

DiT Reproduction Result

We have trained DiT using the origin method with OpenDiT to verify our accuracy. We have trained the model from scratch on ImageNet for 80k steps on 8xA100. Here are some results generated by our trained DiT:

Results

Our loss also aligns with the results listed in the paper:

Loss

To reproduce our results, you can follow our instruction.

Acknowledgement

We extend our gratitude to Zangwei Zheng for providing valuable insights into algorithms and aiding in the development of the video pipeline. Additionally, we acknowledge Shenggan Cheng for his guidance on code optimization and parallelism. Our appreciation also goes to Fuzhao Xue, Shizun Wang, Yuchao Gu, Shenggui Li, and Haofan Wang for their invaluable advice and contributions.

This codebase borrows from:

  • OpenSora: Democratizing Efficient Video Production for All.
  • DiT: Scalable Diffusion Models with Transformers.
  • PixArt: An open-source DiT-based text-to-image model.
  • Latte: An attempt to efficiently train DiT for video.

Contributing

If you encounter problems using OpenDiT or have a feature request, feel free to create an issue! We also welcome pull requests from the community.

Citation

@misc{zhao2024opendit,
  author = {Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You},
  title = {OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/NUS-HPC-AI-Lab/OpenDiT}},
}
@misc{zhao2024dsp,
      title={DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers},
      author={Xuanlei Zhao and Shenggan Cheng and Zangwei Zheng and Zheming Yang and Ziming Liu and Yang You},
      year={2024},
      eprint={2403.10266},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

Star History

Star History Chart

More Repositories

1

Neural-Network-Parameter-Diffusion

We introduce a novel approach for parameter generation, named neural network parameter diffusion (p-diff), which employs a standard latent diffusion model to synthesize a new set of parameters
Python
811
star
2

InfoBatch

Lossless Training Speed Up by Unbiased Dynamic Data Pruning
Python
310
star
3

DATM

ICLR 2024, Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching
Python
86
star
4

LARS-ImageNet-PyTorch

Accuracy 77%. Large batch deep learning optimizer LARS for ImageNet with PyTorch and ResNet, using Horovod for distribution. Optional accumulated gradient and NVIDIA DALI dataloader.
Python
37
star
5

oh-my-server

Roff
30
star
6

Dynamic-Tuning

The official implementation of "2024Arxiv Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation"
Python
25
star
7

GEOM

Pytorch implementation of ICML-2024 "Navigating Complexity: Toward Lossless Graph Condensation via Expanding Window Matching"
Python
22
star
8

InfoGrowth

Efficient and Online Dataset Growth Algorithm (with cleanness and diversity awareness) to deal with growing web data
Python
19
star
9

PAD

Prioritize Alignment in Dataset Distillation
Python
14
star
10

pytorch-lamb

PyTorch implementation of LAMB for ImageNet/ResNet-50 training
Python
13
star
11

Helen

The official implementation of "Helen: Optimizing CTR Prediction Models with Frequency-wise Hessian Eigenvalue Regularization"
Python
13
star
12

CTRL

Pytorch implementation of "Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching"
Python
4
star