An Easy, Fast and Memory-Efficient System for DiT Training and Inference
[Homepage] | [Discord] | [WeChat] | [Twitter] | [Zhihu] | [Media]
- [2024/03/20] Propose Dynamic Sequence Parallel (DSP)[paper][doc], achieves 3x speed for training and 2x speed for inference in OpenSora compared with sota sequence parallelism.
- [2024/03/18] Support OpenSora: Democratizing Efficient Video Production for All.
- [2024/02/27] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.
OpenDiT is an open-source project that provides a high-performance implementation of Diffusion Transformer (DiT) powered by Colossal-AI, specifically designed to enhance the efficiency of training and inference for DiT applications, including text-to-video generation and text-to-image generation.
OpenDiT has been adopted by OpenSora, MiniSora.
OpenDiT boasts the performance by the following techniques:
- Up to 80% speedup and 50% memory reduction on GPU
- Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
- Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
- FastSeq: A novel sequence parallelism method
- Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
- Up to 48% communication save for intra-node sequence parallel.
- Break the memory limitation of a single GPU and reduce the overall training and inference time.
- Ease of use
- Huge performance improvement gains with a few line changes
- Users do not need to know the implementation of distributed training.
- Complete pipeline of text-to-image and text-to-video generation
- Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
- Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.
Authors: Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, Yang You
OpenDiT will continue to integrate more open-source DiT models. Stay tuned for upcoming enhancements and additional features!
Prerequisites:
- Python >= 3.10
- PyTorch >= 1.13 (We recommend to use a >2.0 version)
- CUDA >= 11.6
We strongly recommend using Anaconda to create a new environment (Python >= 3.10) to run our examples:
conda create -n opendit python=3.10 -y
conda activate opendit
Install ColossalAI:
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
git checkout adae123df3badfb15d044bd416f0cf29f250bc86
pip install -e .
Install OpenDiT:
git clone https://github.com/oahzxl/OpenDiT
cd OpenDiT
pip install -e .
(Optional but recommended) Install libraries for training & inference speed up (you can run our code without these libraries):
# Install Triton for fused adaln kernel
pip install triton
# Install FlashAttention
pip install flash-attn
# Install apex for fused layernorm kernel
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout 741bdf50825a97664db08574981962d66436d16a
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
Here are our supported models and their usage:
Model | Source | Function | Usage | Optimize |
---|---|---|---|---|
DiT | https://github.com/facebookresearch/DiT | label-to-image | Usage | β |
OpenSora | https://github.com/hpcaitech/Open-Sora | text-to-video | Usage | β |
DSP (Dynamic Sequence Parallelism) is a novel, elegant and super efficient sequence parallelism for OpenSora, Latte and other multi-dimensional transformer architecture.
It achieves 3x speed for training and 2x speed for inference in OpenSora compared with sota sequence parallelism (DeepSpeed Ulysses). For a 10s (80 frames) of 512x512 video, the inference latency of OpenSora is:
Method | 1xH800 | 8xH800 (DS Ulysses) | 8xH800 (DSP) |
---|---|---|---|
Latency(s) | 106 | 45 | 22 |
See its detail and usage here.
FastSeq [doc]
FastSeq is a novel sequence parallelism for large sequences and small-scale parallelism.
It focuses on minimizing sequence communication by employing only two communication operators for every transformer layer, and we an async ring to overlap AllGather communication with qkv computation. See its detail and usage here.
We have trained DiT using the origin method with OpenDiT to verify our accuracy. We have trained the model from scratch on ImageNet for 80k steps on 8xA100. Here are some results generated by our trained DiT:
Our loss also aligns with the results listed in the paper:
To reproduce our results, you can follow our instruction.
We extend our gratitude to Zangwei Zheng for providing valuable insights into algorithms and aiding in the development of the video pipeline. Additionally, we acknowledge Shenggan Cheng for his guidance on code optimization and parallelism. Our appreciation also goes to Fuzhao Xue, Shizun Wang, Yuchao Gu, Shenggui Li, and Haofan Wang for their invaluable advice and contributions.
This codebase borrows from:
- OpenSora: Democratizing Efficient Video Production for All.
- DiT: Scalable Diffusion Models with Transformers.
- PixArt: An open-source DiT-based text-to-image model.
- Latte: An attempt to efficiently train DiT for video.
If you encounter problems using OpenDiT or have a feature request, feel free to create an issue! We also welcome pull requests from the community.
@misc{zhao2024opendit,
author = {Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You},
title = {OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/NUS-HPC-AI-Lab/OpenDiT}},
}
@misc{zhao2024dsp,
title={DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers},
author={Xuanlei Zhao and Shenggan Cheng and Zangwei Zheng and Zheming Yang and Ziming Liu and Yang You},
year={2024},
eprint={2403.10266},
archivePrefix={arXiv},
primaryClass={cs.DC}
}