• Stars
    star
    1,365
  • Rank 34,436 (Top 0.7 %)
  • Language
    Python
  • License
    GNU Affero Genera...
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code and models for the paper "One Transformer Fits All Distributions in Multi-Modal Diffusion"

UniDiffuser

Code and models for the paper "One Transformer Fits All Distributions in Multi-Modal Diffusion"

Hugging Face Spaces Open In Colab Replicate Diffusers drawing

UniDiffuser is a unified diffusion framework to fit all distributions relevant to a set of multi-modal data in one model. Its key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).

drawing


Dependency

conda create -n unidiffuser python=3.9
conda activate unidiffuser
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116  # install torch-1.13.1
pip install accelerate==0.12.0 absl-py ml_collections einops ftfy==6.1.1 transformers==4.23.1
pip install -e git+https://github.com/openai/CLIP.git@main#egg=clip

# xformers is optional, but it would greatly speed up the attention computation.
pip install -U xformers
pip install -U --pre triton
  • We highly suggest install xformers, which would greatly speed up the attention computation for both training and inference.

Pretrained Models

UniDiffuser employs a variation of transformer, called U-ViT, which parameterizes the joint noise prediction network. Other components perform as encoders and decoders of different modalities, including a pretrained image autoencoder from Stable Diffusion, a pretrained image ViT-B/32 CLIP encoder, a pretrained text ViT-L CLIP encoder, and a GPT-2 text decoder finetuned by ourselves.

We provide two versions of UniDiffuser, which contain U-ViT of 1B parameters and can run on a GPU with at least 10 GB memory. They can be downloaded from Hugging Face:

  • UniDiffuser-v0: This version is trained on LAION-5B at 512x512 resolution, which contains noisy webdata of text-image pairs.
  • UniDiffuser-v1: This version is resumed from UniDiffuser-v0, and is further trained with a set of less noisy internal text-image pairs. It uses a flag as its input to distinguish webdata and internal data during training.

Both links contain three files:

  • autoencoder_kl.pth is the weight of the image autoencoder converted from Stable Diffusion.
  • caption_decoder.pth is the weight of the finetuned GPT-2 text decoder.
  • uvit_v0.pth or uvit_v1.pth is the weight of U-ViT for UniDiffuser-v0 or UniDiffuser-v1.

Note that UniDiffuser-v0 and UniDiffuser-v1 share the same autoencoder_kl.pth and caption_decoder.pth. You only need to download them once. As for other components, they will be automatically downloaded.

After downloading, create a new folder named models and put all pretrained models into this folder as follows:

β”œβ”€β”€ models
β”‚Β Β  └── autoencoder_kl.pth
β”‚Β Β  └── caption_decoder.pth
β”‚   └── uvit_v0.pth or uvit_v1.pth

Inference

We suggest to use UniDiffuser-v1 for a better performance. Results are put into out directory by default.

  • text-to-image generation
python sample_multi_v1.py --mode=t2i --prompt="an elephant under the sea"
  • image-to-text generation
python sample_multi_v1.py --mode=i2t --img=assets/space.jpg
  • joint generation
python sample_multi_v1.py --mode=joint
  • image generation
python sample_multi_v1.py --mode=i
  • text generation
python sample_multi_v1.py --mode=t
  • image variation
python sample_multi_v1.py --mode=i2t2i --img=assets/space.jpg
  • text variation
python sample_multi_v1.py --mode=t2i2t --prompt="an elephant under the sea"

We provide all supported arguments below

all supported arguments:
    --mode                          type of generation, one of t2i / i2t / joint / i / t / i2t2i/ t2i2t
                                        t2i: text to image
                                        i2t: image to text
                                        joint: joint generation of text and image
                                        i: only generate image
                                        t: only generate text
                                        i2t2i: image variation, first image to text, then text to image
                                        t2i2t: text variation, first text to image, the image to text
    --prompt                        the prompt for text-to-image generation and text variation
    --img                           the image path for image-to-text generation and image variation
    --n_samples                     the number of samples to generate, default is 1
    --nrow                          number of images displayed in each row of the grid, default is 4
    --output_path                   dir to write results to, default is out
    --config.seed                   random seed, default is 1234
    --config.sample.sample_steps    number of dpm_solver sampling steps, default is 50
    --config.sample.scale           the classfier-free guidance for conditional generation, default is 7
    --config.sample.t2i_cfg_mode    used for text-to-image generation, one of true_uncond / empty_token, default is true_uncond
                                        true_uncond: use the unconditional model of UniDiffuser to perform classifier-free guidance
                                        empty_token: use the empty string to perform classifier-free guidance
    --config.data_type              one of 0 / 1, used for UniDiffuser-v1, default is 1
                                        0: corresponds to WebDataset during training
                                        1: corresponds to internal data during training

The inference command of UniDiffuser-v0 is basically the same as UniDiffuser-v1, only need to change sample_multi_v1.py to sample_multi_v0.py. For example:

python sample_multi_v0.py --mode=t2i --prompt="an elephant under the sea"

Integration with 🧨 diffusers

UniDiffuser is also available in 🧨 diffusers. It is available in six different modes. Here is how one can use the UniDiffuserPipeline to generate images from text:

import torch
from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

To learn more details, check out the official UniDiffuser documentation.

References

If you find the code useful for your research, please consider citing

@article{bao2022one,
  title={One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale},
  author={Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun},
  year={2023}
}

@inproceedings{bao2022all,
  title={All are Worth Words: A ViT Backbone for Diffusion Models},
  author={Bao, Fan and Nie, Shen and Xue, Kaiwen and Cao, Yue and Li, Chongxuan and Su, Hang and Zhu, Jun},
  booktitle = {CVPR},
  year={2023}
}

This implementation is heavily based on the U-ViT code.

More Repositories

1

tianshou

An elegant PyTorch deep reinforcement learning library.
Python
7,810
star
2

zhusuan

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow
Python
2,202
star
3

prolificdreamer

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation (NeurIPS 2023 Spotlight)
Python
1,472
star
4

CRM

[ECCV 2024] Single Image to 3D Textured Mesh in 10 seconds with Convolutional Reconstruction Model.
Python
520
star
5

ares

A Python library for adversarial machine learning focusing on benchmarking adversarial robustness.
Python
480
star
6

SageAttention

Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Python
222
star
7

controlvideo

Official implementation for "ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing"
Python
181
star
8

warplda

Cache efficient implementation for Latent Dirichlet Allocation
C++
161
star
9

3D_Corruptions_AD

Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving, CVPR 2023
Python
119
star
10

low-bit-optimizers

Low-bit optimizers for PyTorch
Python
114
star
11

MMTrustEval

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
Python
89
star
12

stochastic_gcn

Stochastic training of graph convolutional networks
Python
84
star
13

RoboticsDiffusionTransformer

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Python
72
star
14

Attack-Bard

Python
53
star
15

DPM-Solver-v3

Official code for "DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics" (NeurIPS 2023)
48
star
16

tianshou-docs-zh_CN

ε€©ζŽˆδΈ­ζ–‡ζ–‡ζ‘£
TeX
46
star
17

Prior-Guided-RGF

Python
41
star
18

zh-clip

Python
41
star
19

SRPO

Codes accompanying the paper "Score Regularized Policy Optimization through Diffusion Behavior" (ICLR 2024).
Python
36
star
20

vflow

Official code for "VFlow: More Expressive Generative Flows with Variational Data Augmentation" (ICML 2020)
Python
35
star
21

AT3D

Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition, CVPR 2023, Highlight
Python
34
star
22

implicit-normalizing-flows

Code for "Implicit Normalizing Flows" (ICLR 2021 spotlight)
Python
34
star
23

HiDe-Prompt

Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality (NeurIPS 2023, Spotlight)
Python
30
star
24

BigTopicModel

Big Topic Model is a fast engine for running large-scale Topic Models.
C++
22
star
25

NUNO

[ICML 2023] Non-Uniform Neural Operator (NUNO)
Python
18
star
26

IODF

C++
16
star
27

fpovi

Code for "Function Space Particle Optimization for Bayesian Neural Networks"
Python
16
star
28

CF-UIcA

Code for "Collaborative Filtering with User-Item Co-Autoregressive Models"
Python
15
star
29

Zhusuan-Jittor

Zhusuan with backend Jittor
Python
14
star
30

LM-Calibration

Python
12
star
31

mmdcgm-ssl

mmDCGMs for accurate classification and excellent class-conditional generation in semi-supervised learning
Python
11
star
32

Zhusuan-PaddlePaddle

Zhusuan with backend PaddlePaddle
Python
8
star
33

ood-dgm

Python
8
star
34

MEM_DGM

Code for "Learning to Generate with Memory"
Python
8
star
35

ProbML-book-solution

Jupyter Notebook
7
star
36

adversarial_training_imagenet

Python
7
star
37

pmd

Population matching discrepancy
Python
7
star
38

CEURL

Official implementation for "PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning" (NeurIPS 2024)
Python
7
star
39

VCAS

Official code for "Efficient Backpropagation with Variance Controlled Adaptive Sampling" (ICLR 2024)
Python
6
star
40

imagenet-a-plus

Python
4
star
41

wmvl

Code for "A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models"
Jupyter Notebook
3
star
42

sEM-vr

code for pLSA and LDA in the paper "Stochastic Expectation Maximization with Variance Reduction"
C++
3
star
43

Efficient-Diffusion-Alignment

Official Codebase for "Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control" (NeurIPS 2024)
3
star
44

Noise-Contrastive-Alignment

Code accompanying the paper "Noise Contrastive Alignment of Language Models with Explicit Rewards"
2
star
45

i-DODE

Official code for "Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs" (ICML 2023)
1
star
46

CCA

Codes accompanying the paper "Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment"
Python
1
star
47

ACTNN-PaddlePaddle

Python
1
star
48

DBIM

Official codebase for "Diffusion Bridge Implicit Models" (https://arxiv.org/abs/2405.15885).
1
star
49

Jetfire-INT8Training

Jupyter Notebook
1
star