Yingqing He*, Shaoshu Yang*, Haoxin Chen, Xiaodong Cun, Menghan Xia,
Yong Zhang#, Xintao Wang, Ran He, Qifeng Chen#, and Ying Shan
(* first author, # corresponding author)
Input: "A beautiful girl on a boat"; Resolution: 2048 x 1152.Input: "Miniature house with plants in the potted area, hyper realism, dramatic ambient lighting, high detail"; Resolution: 4096 x 4096.
Arbitrary higher-resolution generation based on SD 2.1.
ScaleCrafter is capable of generating images with a resolution of 4096 x 4096 and videos with a resolution of 2048 x 1152 based on pre-trained diffusion models on a lower resolution. Notably, our approach needs no extra training/optimization.
- Welcome everyone to collaborate on the code repository, improve methods, and do more downstream tasks. Please check the CONTRIBUTING.md
- If you have any questions or comments, we are open for discussion.
In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
- [2023.10.12]: π₯ Release paper.
- [2023.10.12]: π₯ Release source code of both diffuser version and lightning version.
- [2023.10.16]: Integrate FreeU as the default mode to further improve our higher-res generation quality. (If you want disable this function, add
--disable_freeu
).
- Hugging Face Gradio demo
- ScaleCrafter with more controls (e.g., ControlNet/T2I Adapter)
conda create -n scalecrafter python=3.8
conda activate scalecrafter
pip install -r requirements.txt
# 2048x2048 (4x) generation
python3 text2image_xl.py \
--pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 \
--validation_prompt "a professional photograph of an astronaut riding a horse" \
--seed 23 \
--config ./configs/sdxl_2048x2048.yaml \
--logging_dir ${your-logging-dir}
To generate in other resolutions, change the value of the parameter --config
to:
- 2048x2048:
./configs/sdxl_2048x2048.yaml
- 2560x2560:
./configs/sdxl_2560x2560.yaml
- 4096x2048:
./configs/sdxl_4096x2048.yaml
- 4096x4096:
./configs/sdxl_4096x4096.yaml
Generated images will be saved to the directory set by ${your-logging-dir}
. You can use your customized prompts by setting --validation_prompt
to a prompt string or a path to your custom .txt
file. Make sure different prompts are in different lines if you are using a .txt
prompt file.
--pretrained_model_name_or_path
specifies the pretrained model to be used. You can provide a huggingface repo name (it will download the model from huggingface first), or a local directory where you save the model checkpoint.
You can create your custom generation resolution setting by creating a .yaml
configuration file and specifying the layer to use our method and its dilation scale. Please see ./assets/dilate_setttings/sdxl_2048x2048_dilate.txt
as an example.
# sd v1.5 1024x1024 (4x) generation
python3 text2image.py \
--pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 \
--validation_prompt "a professional photograph of an astronaut riding a horse" \
--seed 23 \
--config ./configs/sd1.5_1024x1024.yaml \
--logging_dir ${your-logging-dir}
# sd v2.1 1024x1024 (4x) generation
python3 text2image.py \
--pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base \
--validation_prompt "a professional photograph of an astronaut riding a horse" \
--seed 23 \
--config ./configs/sd2.1_1024x1024.yaml \
--logging_dir ${your-logging-dir}
To generate in other resolutions please use the following config files:
- 1024x1024:
./configs/sd1.5_1024x1024.yaml
./configs/sd2.1_1024x1024.yaml
- 1280x1280:
./configs/sd1.5_1280x1280.yaml
./configs/sd2.1_1280x1280.yaml
- 2048x1024:
./configs/sd1.5_2048x1024.yaml
./configs/sd2.1_2048x1024.yaml
- 2048x2048:
./configs/sd1.5_2048x2048.yaml
./configs/sd2.1_2048x2048.yaml
Please see the instructions above to use your customized text prompt.
We implement MATLAB functions to achieve convolution dispersion. To use the functions, change your MATLAB working directory to /disperse
. Solve the convlution dispersion transform with
# Small kernel 3, large kernel 5, input feature size 3, perceptual field enlarge scale 2
# Loss weighting 0.05, verbose (deliver visualization) true
R = kernel_disperse(3, 5, 3, 2, 0.05, true)
Then one can save the transform by right-clicking R
in the workspace window and save this parameter in .mat
format. We recommend using input feature size to match the size of small kernel, since it can speed up the computation.
Empirically, this performs well for all convolution kernels in the UNet.
One can also compute a specific dispersion transform for every input feature size in the diffusion model UNet.
π₯ LongerCrafter: Tuning-free method for longer high-quality video generation.
π₯ VideoCrafter: Framework for high-quality video generation.
π₯ TaleCrafter: An interactive story visualization tool that supports multiple characters.
@inproceedings{he2023scalecrafter,
title={Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models},
author={He, Yingqing and Yang, Shaoshu and Chen, Haoxin and Cun, Xiaodong and Xia, Menghan and Zhang, Yong and Wang, Xintao and He, Ran and Chen, Qifeng and Shan, Ying},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}
If you have any comments or questions, feel free to contact Yingqing He or Shaoshu Yang.