• Stars
    star
    679
  • Rank 66,532 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created almost 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Karlo-v1.0.alpha on COYO-100M and CC15M

Karlo is a text-conditional image generation model based on OpenAI's unCLIP architecture with the improvement over the standard super-resolution model from 64px to 256px, recovering high-frequency details only in the small number of denoising steps.

"a portrait of an old monk, highly detailed."

"Photo of a business woman, silver hair"

"A teddy bear on a skateboard, children drawing style."

"Goryeo celadon in the shape of bird"

This alpha version of Karlo is trained on 115M image-text pairs, including COYO-100M high-quality subset, CC3M, and CC12M. For those who are interested in a better version of Karlo trained on more large-scale high-quality datasets, please visit the landing page of our application B^DISCOVER.

Updates

Model Architecture

Overview

Karlo is a text-conditional diffusion model based on unCLIP, composed of prior, decoder, and super-resolution modules. In this repository, we include the improved version of the standard super-resolution module for upscaling 64px to 256px only in 7 reverse steps, as illustrated in the figure below:

In specific, the standard SR module trained by DDPM objective upscales 64px to 256px in the first 6 denoising steps based on the respacing technique. Then, the additional fine-tuned SR module trained by VQ-GAN-style loss performs the final reverse step to recover high-frequency details. We observe that this approach is very effective to upscale the low-resolution in a small number of reverse steps.

Details

We train all components from scratch on 115M image-text pairs including COYO-100M, CC3M, and CC12M. In the case of Prior and Decoder, we use ViT-L/14 provided by OpenAI’s CLIP repository. Unlike the original implementation of unCLIP, we replace the trainable transformer in the decoder into the text encoder in ViT-L/14 for efficiency. In the case of the SR module, we first train the model using the DDPM objective in 1M steps, followed by additional 234K steps to fine-tune the additional component. The table below summarizes the important statistics of our components:

Prior Decoder SR
CLIP ViT-L/14 ViT-L/14 -
#param 1B 900M 700M + 700M
#optimization steps 1M 1M 1M + 0.2M
#sampling steps 25 50 (default), 25 (fast) 7
Checkpoint links ViT-L-14, ViT-L-14 stats, model model model

In the checkpoint links, ViT-L-14 is equivalent to the original version, but we include it for convenience. We also remark that ViT-L-14-stats is required to normalize the outputs of the prior module.

Evaluation

We quantitatively measure the performance of Karlo-v1.0.alpha in the validation split of CC3M and MS-COCO. The table below presents CLIP-score and FID. To measure FID, we resize the image of the shorter side to 256px, followed by cropping it at the center. We set classifier-free guidance scales for prior and decoder to 4 and 8 in all cases. We observe that our model achieves reasonable performance even with 25 sampling steps of decoder.

CC3M

Sampling step CLIP-s (ViT-B/16) FID (13k from val)
Prior (25) + Decoder (25) + SR (7) 0.3081 14.37
Prior (25) + Decoder (50) + SR (7) 0.3086 13.95

MS-COCO

Sampling step CLIP-s (ViT-B/16) FID (30k from val)
Prior (25) + Decoder (25) + SR (7) 0.3192 15.24
Prior (25) + Decoder (50) + SR (7) 0.3192 14.43

For more information, please refer to the upcoming technical report.

🧨 Diffusers integration

Our unCLIP implemenetation is officially integrated in the 🧨 diffusers library

#Requisits to run Karlo unCLIP on diffusers
pip install diffusers transformers accelerate safetensors
from diffusers import UnCLIPPipeline
import torch

pipe = UnCLIPPipeline.from_pretrained("kakaobrain/karlo-v1-alpha", torch_dtype=torch.float16)
pipe = pipe.to('cuda')

prompt = "a high-resolution photograph of a big red frog on a green leaf."
image = pipe(prompt).images[0]
image.save("./frog.png")

Check out the diffusers docs for the full usage of the unCLIPPipeline

Environment Setup

We use a single V100 of 32GB VRAM for sampling under PyTorch >= 1.10 and CUDA >= 11. The following commands install additional python packages and get pretrained model checkpoints. Or, you can simply install the package and download the weights via setup.sh

  • Additional python packages
pip install -r requirements.txt
  • Model checkpoints
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/096db1af569b284eb76b3881534822d9/ViT-L-14.pt -P $KARLO_ROOT_DIR  # same with the official ViT-L/14 from OpenAI
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/0b62380a75e56f073e2844ab5199153d/ViT-L-14_stats.th -P $KARLO_ROOT_DIR
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/efdf6206d8ed593961593dc029a8affa/decoder-ckpt-step%3D01000000-of-01000000.ckpt -P $KARLO_ROOT_DIR
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/85626483eaca9f581e2a78d31ff905ca/prior-ckpt-step%3D01000000-of-01000000.ckpt -P $KARLO_ROOT_DIR
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/4226b831ae0279020d134281f3c31590/improved-sr-ckpt-step%3D1.2M.ckpt -P $KARLO_ROOT_DIR

Sampling

Gradio demo (T2I and Image variation)

The following command launches gradio demo for text-to-image generation and image variation. We notice that the second run in the gradio is unexpectedly slower than the usual case in PyTorch>=1.12. We guess that this happens because launching the cuda kernels takes some time, usually up to 2 minutes.

python demo/product_demo.py --host 0.0.0.0 --port $PORT --root-dir $KARLO_ROOT_DIR

Samples below are non-cherry picked T2I and image variation examples of random seed 0. In each case, the first row shows T2I samples and the second shows the image variation samples of the leftmost image in the first row.

[T2I + Image variation] "A man with a face of avocado, in the drawing style of Rene Magritte."

[T2I + Image variation] "a black porcelain in the shape of pikachu"

T2I command line example

Here, we include the command line example of T2I. For image variation, you can refer to karlo/sampler/i2i.py on how to replace the prior into the clip image feature.

python example.py --root-dir=$KARLO_ROOT_DIR \
                  --prompt="A man with a face of avocado, in the drawing style of Rene Magritte" \
                  --output-dir=$OUTPUT_DIR \
                  --max-bsz=2 \
                  --sampling-type=fast

Licence and Disclaimer

This project including the weights are distributed under CreativeML Open RAIL-M license, equivalent version of Stable Diffusion v1. You may use this model in commercial applications, but it is highly recommended to adopt a powerful safe checker as a post-processing. We also remark that we are not responsible for any kinds of use of the generated images.

BibTex

If you find this repository useful in your research, please cite:

@misc{kakaobrain2022karlo-v1-alpha,
  title         = {Karlo-v1.0.alpha on COYO-100M and CC15M},
  author        = {Donghoon Lee, Jiseob Kim, Jisu Choi, Jongmin Kim, Minwoo Byeon, Woonhyuk Baek and Saehoon Kim},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/karlo}},
}

Acknowledgement

Contact

If you would like to collaborate with us or share a feedback, please e-mail to us, [email protected]

More Repositories

1

fast-autoaugment

Official Implementation of 'Fast AutoAugment' in PyTorch.
Python
1,587
star
2

nerf-factory

An awesome PyTorch NeRF library
Python
1,265
star
3

pororo

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing
Python
1,252
star
4

coyo-dataset

COYO-700M: Large-scale Image-Text Pair Dataset
Python
1,062
star
5

kogpt

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)
Python
1,000
star
6

torchgpipe

A GPipe implementation in PyTorch
Python
776
star
7

rq-vae-transformer

The official implementation of Autoregressive Image Generation using Residual Quantization (CVPR '22)
Jupyter Notebook
669
star
8

mindall-e

PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs
Python
630
star
9

word2word

Easy-to-use word-to-word translations for 3,564 language pairs.
Python
350
star
10

torchlars

A LARS implementation in PyTorch
Python
326
star
11

g2pm

A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
Python
326
star
12

kor-nlu-datasets

KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding
283
star
13

trident

A performance library for machine learning applications.
Python
176
star
14

autoclint

A specially designed light version of Fast AutoAugment
Python
170
star
15

sparse-detr

PyTorch Implementation of Sparse DETR
Python
150
star
16

hotr

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)
Python
132
star
17

kortok

The code and models for "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks" (AACL-IJCNLP 2020)
Python
114
star
18

bassl

Python
113
star
19

scrl

PyTorch Implementation of Spatially Consistent Representation Learning(SCRL)
Python
108
star
20

flame

Official implementation of the paper "FLAME: Free-form Language-based Motion Synthesis & Editing"
Python
103
star
21

brain-agent

Brain Agent for Large-Scale and Multi-Task Agent Learning
Python
92
star
22

helo-word

Team Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task
Python
88
star
23

jejueo

Jejueo Datasets for Machine Translation and Speech Synthesis
Python
74
star
24

solvent

Python
66
star
25

noc

Jupyter Notebook
44
star
26

cxr-clip

Python
43
star
27

expgan

Python
41
star
28

autowu

Official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th ICML Workshop on AutoML)
Python
39
star
29

nvs-adapter

Python
33
star
30

ginr-ipc

The official implementation of Generalizable Implicit Neural Representations with Instance Pattern Composers(CVPR’23 highlight).
Python
30
star
31

coyo-vit

ViT trained on COYO-Labeled-300M dataset
Python
28
star
32

irm-empirical-study

An Empirical Study of Invariant Risk Minimization
Python
28
star
33

coyo-align

ALIGN trained on COYO-dataset
Python
25
star
34

magvlt

The official implementation of MAGVLT: Masked Generative Vision-and-Language Transformer (CVPR'23)
Python
23
star
35

hqtransformer

Locally Hierarchical Auto-Regressive Modeling for Image Generation (HQ-Transformer)
Jupyter Notebook
21
star
36

CheXGPT

Python
18
star
37

learning-loss-for-tta

"Learning Loss for Test-Time Augmentation (NeurIPS 2020)"
Python
9
star
38

stg

Official implementation of Selective Token Generation (COLING'22)
Jupyter Notebook
8
star
39

leco

Official implementation of LECO (NeurIPS'22)
Python
6
star
40

bc-hyperopt-example

brain cloud hyperopt example (mnist)
Python
3
star