Würstchen
What is this?
Würstchen is a new framework for training text-conditional models by moving the computationally expensive text-conditional stage into a highly compressed latent space. Common approaches make use of a single stage compression, while Würstchen introduces another Stage that introduces even more compression. In total we have Stage A & B that are responsible for compressing images and Stage C that learns the text-conditional part in the low dimensional latent space. With that Würstchen achieves a 42x compression factor, while still reconstructing images faithfully. This enables training of Stage C to be fast and computationally cheap. We refer to the paper for details.
Use Würstchen
You can use the model simply through the notebooks here. The Stage B notebook only for reconstruction and the Stage C notebook is for the text-conditional generation. You can also try the text-to-image generation on Google Colab.
Using in 🧨 diffusers
Würstchen is fully integrated into the diffusers
library. Here's how to use it:
# pip install -U transformers accelerate diffusers
import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")
caption = "Anthropomorphic cat dressed as a fire fighter"
images = pipe(
caption,
width=1024,
height=1536,
prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
prior_guidance_scale=4.0,
num_images_per_prompt=2,
).images
Refer to the official documentation to learn more.
Train your own Würstchen
Training Würstchen is considerably faster and cheaper than other text-to-image as it trains in a much smaller latent space of 12x12. We provide training scripts for both Stage B and Stage C.
Download Models
Model | Download | Parameters | Conditioning | Training Steps | Resolution |
---|---|---|---|---|---|
Würstchen v1 | Hugging Face | 1B (Stage C) + 600M (Stage B) + 19M (Stage A) | CLIP-H-Text | 800.000 | 512x512 |
Würstchen v2 | Hugging Face | 1B (Stage C) + 600M (Stage B) + 19M (Stage A) | CLIP-bigG-Text | 918.000 | 1024x1024 |
Acknowledgment
Special thanks to Stability AI for providing compute for our research.