Diffusion Models already have a Semantic Latent Space (ICLR2023 notable-top-25%)
Diffusion Models already have a Semantic Latent Space
Mingi Kwon, Jaeseok Jeong, Youngjung Uh
Arxiv preprint.Abstract:
Diffusion models achieve outstanding generative performance in various domains. Despite their great success, they lack semantic latent space which is essential for controlling the generative process. To address the problem, we propose asymmetric reverse process (Asyrp) which discovers the semantic latent space in frozen pretrained diffusion models. Our semantic latent space, named h-space, has nice properties for accommodating semantic image manipulation: homogeneity, linearity, robustness, and consistency across timesteps. In addition, we introduce a principled design of the generative process for versatile editing and quality boosting by quantifiable measures: editing strength of an interval and quality deficiency at a timestep. Our method is applicable to various architectures (DDPM++, iDDPM, and ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUN-bedroom, and METFACES).
Description
This repo includes the official Pytorch implementation of Asyrp: Diffusion Models already have a Semantic Latent Space.
- Asyrp allows using h-space, the bottleneck of the U-Net, as a semantic latent space of diffusion models.
Edited real images (Top) as Happy dog
(Bottom). So cute!!
Getting Started
We recommend running our code using NVIDIA GPU + CUDA, CuDNN.
Pretrained Models for Asyrp
Asyrp works on the checkpoints of pretrained diffusion models.
Image Type to Edit | Size | Pretrained Model | Dataset | Reference Repo. |
---|---|---|---|---|
Human face | 256×256 | Diffusion (Auto) | CelebA-HQ | SDEdit |
Human face | 256×256 | Diffusion | CelebA-HQ | P2 weighting |
Human face | 256×256 | Diffusion | FFHQ | P2 weighting |
Church | 256×256 | Diffusion (Auto) | LSUN-Bedroom | SDEdit |
Bedroom | 256×256 | Diffusion (Auto) | LSUN-Church | SDEdit |
Dog face | 256×256 | Diffusion | AFHQ-Dog | ILVR |
Painting face | 256×256 | Diffusion | METFACES | P2 weighting |
ImageNet | 256x256 | Diffusion | ImageNet | Guided Diffusion |
-
The pretrained Diffuson models on 256x256 images in CelebA-HQ, LSUN-Church, and LSUN-Bedroom are automatically downloaded in the code. (codes from DiffusionCLIP)
-
In contrast, you need to download the models pretrained on other datasets in the table and put it in the
./pretrained
directory. -
You can manually revise the checkpoint paths and names in the
./configs/paths_config.py
file. -
We used CelebA-HQ pretrained model from SDEdit but we found from P2 weighting is better. We highly recommend to use P2 weighting models rather than SDEdit.
Datasets
To precompute latents and find the direction of h-space, you need about 100+ images in the dataset. You can use both sampled images from the pretrained models or real images from the pretraining dataset.
If you want to use real images, check the URLs :
You can simply modify ./configs/paths_config.py
for dataset path.
CUSTOM Datasets
If you want to use a custom dataset, you can use the config/custom.yml
file.
- You have to match
data.dataset
incustom.yml
with your data domain. For example, if you want to use Human Face images,data.dataset
should beCelebA_HQ
orFFHQ
. data.category
should be'CUSTOM'
- Then, you can use the below arguments:
--custom_train_dataset_dir "your/costom/dataset/dir/train" \
--custom_test_dataset_dir "your/costom/dataset/dir/test" \
Get LPIPS distance
We provide precomputed LPIPS distances for CelebA_HQ
, LSUN-Bedroom
, LSUN-Church
, AFHQ-Dog
, and METFACES
in the ./utils
.
If you want to use the custom/other dataset, we recommand to precompute LPIPS distance.
To precompute LPIPS distance for automatically defined t_edit & t_boost, run the following commands using script_get_lpips.sh
.
python main.py --lpips \
--config $config \
--exp ./runs/tmp \
--edit_attr test \
--n_train_img 100 \
--n_inv_step 1000
$config
:celeba.yml
for human face,bedroom.yml
for bedroom,church.yml
for church,afhq.yml
for dog face,imagenet.yml
for images from ImageNet,metface.yml
for artistic face from METFACES,ffqh.yml
for human face from FFHQ.exp
: Experiment name.edit_attr
: Attribute to edit. But not used for now. you can use./utils/text_dic.py
to predefined source-target text pairs or define new pair.n_train_img
: LPIPS distance from # of images.n_inv_step
: # of steps during the generative pross for the inversion. You can use--n_inv_step 50
for speed.
Asyrp
To train the implicit function f, you can prepare two optional things. 1) get LPIPS distances 2) precompute
We alredy provide precomputed LPIPS distances for CelebA_HQ
, LSUN-Bedroom
, LSUN-Church
, AFHQ-Dog
, and METFACES
in the ./utils
.
If you want to use your own defined-t_edit (e.g., 500) and defined-t_boost (e.g., 200), you don't need to get LPIPS distances.
For that case, you can can use the below arguments:
--user_defined_t_edit 500 \
--user_defined_t_addnoise 200 \
If you want to train with sampled images, you don't need to precompute real images. For that case you can use the below argument:
--load_random_noise
Precompute real images
To precompute real images for saving time, run the follwing commands using script_precompute.sh
.
python main.py --run_train \
--config $config \
--exp ./runs/tmp \
--edit_attr test \
--do_train 1 \
--do_test 1 \
--n_train_img 100 \
--n_test_img 32 \
--bs_train 1 \
--n_inv_step 50 \
--n_train_step 50 \
--n_test_step 50 \
--just_precompute
Train the implicit function
To train the implicit function, run the following commands using script_train.sh
python main.py --run_train \
--config $config \
--exp ./runs/example \
--edit_attr $guid \
--do_train 1 \
--do_test 1 \
--n_train_img 100 \
--n_test_img 32 \
--n_iter 5 \
--bs_train 1 \
--t_0 999 \
--n_inv_step 50 \
--n_train_step 50 \
--n_test_step 50 \
--get_h_num 1 \
--train_delta_block \
--save_x0 \
--use_x0_tensor \
--lr_training 0.5 \
--clip_loss_w 1.0 \
--l1_loss_w 3.0 \
--add_noise_from_xt \
--lpips_addnoise_th 1.2 \
--lpips_edit_th 0.33 \
--sh_file_name $sh_file_name \
(optional - if you pass "get LPIPS")
--user_defined_t_edit 500 \
--user_defined_t_addnoise 200 \
(optional - if you pass "precompute")
--load_random_noise
do_test
: If you finish training quickly withouth checking the outputs in the middle of training, you can setdo_test
as 0.bs_train
: batch size.n_iter
: iterget_h_num
: It determine the number of attributes. It should be1
for training.train_delta_block
: Train the implicit function. You can use--train_delta_h
instead of--train_delta_block
to optimize direction directly. (we recommend --train_delta_block
)--save_x0
,--use_x0_tensor
: If you want to save the results with original real images, use it.n_inv_step
,n_train_step
,n_test_step
: # of steps during the generative pross for the inversion, training and test respectively. They are in[0, 999]
. We usually use 40 or 1000 forn_inv_step
, 40 or 50 forn_train_step
and 40 or 50 or 1000 forn_test_step
respectively.clip_loss_w
,l1_loss_w
: Weights of CLIP loss and L1 loss.
Inference
After training finished, you can inference with various settings using script_inference.sh
. We provide some of it.
python main.py --run_test \
--config $config \
--exp ./runs/example \
--edit_attr $guid \
--do_train 1 \
--do_test 1 \
--n_train_img 100 \
--n_test_img 32 \
--n_iter 5 \
--bs_train 1 \
--t_0 999 \
--n_inv_step 50 \
--n_train_step 50 \
--n_test_step $test_step \
--get_h_num 1 \
--train_delta_block \
--add_noise_from_xt \
--lpips_addnoise_th 1.2 \
--lpips_edit_th 0.33 \
--sh_file_name $sh_file_name \
--save_x0 \
--use_x0_tensor \
--hs_coeff_delta_h 1.0 \
(optional - checkpoint)
--load_from_checkpoint "exp_name"
or
--manual_checkpoint_name "full_path.pth"
(optional - gradually editing)
--delta_interpolation
--max_delta 1.0
--min_delta -1.0
--num_delta 10
(optinal - multi)
--multiple_attr "exp1 exp2 exp3"
--multiple_hs_coeff "1 0.5 1.5"
exp
: is should be matched with trained exp. If you want to use our pretrained implicit function, you have to set--exp
as$guid
.do_train
,do_test
: Sampling from training dataset / test dataset.n_iter
: Ifn_iter
is same as trained argument, it use last-iter-checkpoint.n_test_step
: You can manually regulate inference step.1000
shows best quality.hs_coeff_delta_h
: You can manually regulate the degree of editing. It can be the minus number.--load_from_checkpoint
,--manual_checkpoint_name
:load_from_checkpoint
should be the name of exp.manual_checkpoint_name
should be the full path of checkpoint.--delta_interpolation
: You can set $max, $min, $num values. The $num of results will use gradually increased dgree of editing from min to max.--multiple_attr
: If you use multiple attributes, write down the name of exps (use blanks as separators). You can use--multiple_hs_coeff
to regulate the degree of editing respectively.
Acknowledge
Codes are based on DiffusionCLIP.