• Stars
    star
    2,699
  • Rank 16,283 (Top 0.4 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 15 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Kandinsky 2 — multilingual text2image latent diffusion model

Kandinsky 2.2

Open In Colab — Inference example

Open In Colab — Fine-tuning with LoRA

Description:

Kandinsky 2.2 brings substantial improvements upon its predecessor, Kandinsky 2.1, by introducing a new, more powerful image encoder - CLIP-ViT-G and the ControlNet support.

The switch to CLIP-ViT-G as the image encoder significantly increases the model's capability to generate more aesthetic pictures and better understand text, thus enhancing the model's overall performance.

The addition of the ControlNet mechanism allows the model to effectively control the process of generating images. This leads to more accurate and visually appealing outputs and opens new possibilities for text-guided image manipulation.

Architecture details:

  • Text encoder (XLM-Roberta-Large-Vit-L-14) - 560M
  • Diffusion Image Prior — 1B
  • CLIP image encoder (ViT-bigG-14-laion2B-39B-b160k) - 1.8B
  • Latent Diffusion U-Net - 1.22B
  • MoVQ encoder/decoder - 67M

Сheckpoints:

  • Prior: A prior diffusion model mapping text embeddings to image embeddings
  • Text-to-Image / Image-to-Image: A decoding diffusion model mapping image embeddings to images
  • Inpainting: A decoding diffusion model mapping image embeddings and masked images to images
  • ControlNet-depth: A decoding diffusion model mapping image embedding and additional depth condition to images

Inference regimes

How to use:

Check our jupyter notebooks with examples in ./notebooks folder

1. text2image

from kandinsky2 import get_kandinsky2
model = get_kandinsky2('cuda', task_type='text2img', model_version='2.2')
images = model.generate_text2img(
    "red cat, 4k photo", 
    decoder_steps=50,
    batch_size=1, 
    h=1024,
    w=768,
)

Kandinsky 2.1

Framework: PyTorch Huggingface space Open In Colab

Habr post

Demo

pip install "git+https://github.com/ai-forever/Kandinsky-2.git"

Model architecture:

Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffusion, while introducing some new ideas.

As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.

For diffusion mapping of latent spaces we use transformer with num_layers=20, num_heads=32 and hidden_size=2048.

Other architecture parts:

  • Text encoder (XLM-Roberta-Large-Vit-L-14) - 560M
  • Diffusion Image Prior — 1B
  • CLIP image encoder (ViT-L/14) - 427M
  • Latent Diffusion U-Net - 1.22B
  • MoVQ encoder/decoder - 67M

Kandinsky 2.1 was trained on a large-scale image-text dataset LAION HighRes and fine-tuned on our internal datasets.

How to use:

Check our jupyter notebooks with examples in ./notebooks folder

1. text2image

from kandinsky2 import get_kandinsky2
model = get_kandinsky2('cuda', task_type='text2img', model_version='2.1', use_flash_attention=False)
images = model.generate_text2img(
    "red cat, 4k photo", 
    num_steps=100,
    batch_size=1, 
    guidance_scale=4,
    h=768, w=768,
    sampler='p_sampler', 
    prior_cf_scale=4,
    prior_steps="5"
)

prompt: "Einstein in space around the logarithm scheme"

2. image fuse

from kandinsky2 import get_kandinsky2
from PIL import Image
model = get_kandinsky2('cuda', task_type='text2img', model_version='2.1', use_flash_attention=False)
images_texts = ['red cat', Image.open('img1.jpg'), Image.open('img2.jpg'), 'a man']
weights = [0.25, 0.25, 0.25, 0.25]
images = model.mix_images(
    images_texts, 
    weights, 
    num_steps=150,
    batch_size=1, 
    guidance_scale=5,
    h=768, w=768,
    sampler='p_sampler', 
    prior_cf_scale=4,
    prior_steps="5"
)

3. inpainting

from kandinsky2 import get_kandinsky2
from PIL import Image
import numpy as np

model = get_kandinsky2('cuda', task_type='inpainting', model_version='2.1', use_flash_attention=False)
init_image = Image.open('img.jpg')
mask = np.ones((768, 768), dtype=np.float32)
mask[:,:550] =  0
images = model.generate_inpainting(
    'man 4k photo', 
    init_image, 
    mask, 
    num_steps=150,
    batch_size=1, 
    guidance_scale=5,
    h=768, w=768,
    sampler='p_sampler', 
    prior_cf_scale=4,
    prior_steps="5"
)

Kandinsky 2.0

Framework: PyTorch Huggingface space Open In Colab

Habr post

Demo

pip install "git+https://github.com/ai-forever/Kandinsky-2.git"

Model architecture:

It is a latent diffusion model with two multilingual text encoders:

  • mCLIP-XLMR 560M parameters
  • mT5-encoder-small 146M parameters

These encoders and multilingual training datasets unveil the real multilingual text-to-image generation experience!

Kandinsky 2.0 was trained on a large 1B multilingual set, including samples that we used to train Kandinsky.

In terms of diffusion architecture Kandinsky 2.0 implements UNet with 1.2B parameters.

Kandinsky 2.0 architecture overview:

How to use:

Check our jupyter notebooks with examples in ./notebooks folder

1. text2img

from kandinsky2 import get_kandinsky2

model = get_kandinsky2('cuda', task_type='text2img')
images = model.generate_text2img('A teddy bear на красной площади', batch_size=4, h=512, w=512, num_steps=75, denoised_type='dynamic_threshold', dynamic_threshold_v=99.5, sampler='ddim_sampler', ddim_eta=0.05, guidance_scale=10)

prompt: "A teddy bear на красной площади"

2. inpainting

from kandinsky2 import get_kandinsky2
from PIL import Image
import numpy as np

model = get_kandinsky2('cuda', task_type='inpainting')
init_image = Image.open('image.jpg')
mask = np.ones((512, 512), dtype=np.float32)
mask[100:] =  0
images = model.generate_inpainting('Девушка в красном платье', init_image, mask, num_steps=50, denoised_type='dynamic_threshold', dynamic_threshold_v=99.5, sampler='ddim_sampler', ddim_eta=0.05, guidance_scale=10)

prompt: "Девушка в красном платье"

3. img2img

from kandinsky2 import get_kandinsky2
from PIL import Image

model = get_kandinsky2('cuda', task_type='img2img')
init_image = Image.open('image.jpg')
images = model.generate_img2img('кошка', init_image, strength=0.8, num_steps=50, denoised_type='dynamic_threshold', dynamic_threshold_v=99.5, sampler='ddim_sampler', ddim_eta=0.05, guidance_scale=10)

Authors

More Repositories

1

ru-gpts

Russian GPT3 models.
Python
2,045
star
2

ru-dalle

Generate images from texts. In Russian
Jupyter Notebook
1,638
star
3

ghost

A new one shot face swap approach for image and video domains
Python
1,030
star
4

ner-bert

BERT-NER (nert-bert) with google bert https://github.com/google-research.
Jupyter Notebook
403
star
5

ru-dolph

RUDOLPH: One Hyper-Tasking Transformer can be creative as DALL-E and GPT-3 and smart as CLIP
Jupyter Notebook
242
star
6

Real-ESRGAN

PyTorch implementation of Real-ESRGAN model
Python
201
star
7

mgpt

Multilingual Generative Pretrained Model
Jupyter Notebook
194
star
8

KandinskyVideo

KandinskyVideo — multilingual end-to-end text2video latent diffusion model
Python
140
star
9

ru-clip

CLIP implementation for Russian language
Jupyter Notebook
126
star
10

ruGPT3_demos

121
star
11

sage

SAGE: Spelling correction, corruption and evaluation for multiple languages
Jupyter Notebook
101
star
12

deforum-kandinsky

Kandinsky x Deforum — generating short animations
Python
100
star
13

digital_peter_aij2020

Materials of the AI Journey 2020 competition dedicated to the recognition of Peter the Great's manuscripts, https://ai-journey.ru/contest/task01
Jupyter Notebook
66
star
14

music-composer

Python
62
star
15

ru-prompts

Python
54
star
16

fusion_brain_aij2021

Creating multimodal multitask models
Jupyter Notebook
47
star
17

model-zoo

NLP model zoo for Russian
44
star
18

gigachat

Библиотека для доступа к GigaChat
Python
43
star
19

OCR-model

An easy-to-run OCR model pipeline based on CRNN and CTC loss
Python
42
star
20

augmentex

Augmentex — a library for augmenting texts with errors
Python
40
star
21

StackMix-OCR

Jupyter Notebook
37
star
22

MoVQGAN

MoVQGAN - model for the image encoding and reconstruction
Jupyter Notebook
35
star
23

MERA

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open benchmark for the Russian language for evaluating fundamental models.
Jupyter Notebook
31
star
24

tuned-vq-gan

Jupyter Notebook
28
star
25

ReadingPipeline

Text reading pipeline that combines segmentation and OCR-models.
Python
23
star
26

htr_datasets

Repository containing our datasets for HTR (handwritten text recognition) task.
Jupyter Notebook
23
star
27

fbc3_aij2023

Jupyter Notebook
20
star
28

mineral-recognition

Python
19
star
29

DigiTeller

18
star
30

fbc2_aij2022

FusionBrain Challenge 2.0: creating multimodal multitask model
Python
16
star
31

combined_solution_aij2019

AI Journey 2019: Combined Solution
Python
15
star
32

railway_infrastructure_detection_aij2021

AI Journey Contest 2021: AITrain
Python
13
star
33

no_fire_with_ai_aij2021

AI Journey Contest 2021: NoFireWithAI
Jupyter Notebook
13
star
34

SEGM-model

An easy-to-run semantic segmentation model based on Unet
Python
11
star
35

ControlledNST

An implementation of Neural Style Transfer in PyTorch.
Jupyter Notebook
8
star
36

kandinsky3-diffusers

Python
5
star
37

mchs-wildfire

Соревнование по классификации лесных пожаров
Jupyter Notebook
4
star
38

no_flood_with_ai_aij2020

Материалы соревнования AI Journey 2020, посвященного прогнозированию паводков на реке Амур, https://ai-journey.ru/contest/task02
Jupyter Notebook
4
star
39

paper_persi_chat

PaperPersiChat: Scientific Paper Discussion Chatbot using Transformers and Discourse Flow Management
Jupyter Notebook
1
star
40

Zoom_In_Video_Kandinsky

Framework for creating Zoom in / Zoom out video based on inpainting Kandinsky
Jupyter Notebook
1
star