• Stars
    star
    129
  • Rank 279,262 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT

VisorGPT ๐ŸŽจ (NeurIPS 2023)

Learning Visual Prior via Generative Pre-Training

Jinheng Xie1ย  Kai Ye2ย  Yudong Li2ย  Yuexiang Li3ย  Yefeng Zheng3 Linlin Shen2ย  Mike Zheng Shou1

1 National University of Singaporeย  2 Shenzhen Universityย  3 Jarvis Research Center, Tencent YouTu Lab

arXiv demo video webpage

Updates

  • [2023/05/23] Paper is available.
  • [2023/05/28] Gradio demo is available.
  • [2023/05/30] Hugging Face demo is available.
  • [2023/06/13] Training code and data are available.
  • [2023/09/22] VisorGPT is accepted by NeurIPS 2023.

Quick Start

Step 1

# clone the repo
git clone https://github.com/Sierkinhane/VisorGPT.git

# go to directory
cd VisorGPT

# create a new environment
conda create -n visorgpt python=3.8

# activate the new environment
conda activate visorgpt

# prepare the basic environments
pip3 install -r requirements.txt

# install controlnet and gligen
cd demo/ControlNet
pip3 install -v -e .
cd ../demo/GLIGEN
pip3 install -v -e .

Step 2 - Download pre-trained weights

Download visorgpt, controlnet-pose2img, controlnet-sd, gligen-bbox2img, and put them as follow:

โ”œโ”€โ”€ demo/
|   โ”œโ”€โ”€ ckpts
|   |   โ”œโ”€โ”€ controlnet
|   |   |   โ”œโ”€โ”€ control_v11p_sd15_openpose.pth
|   |   |   โ”œโ”€โ”€ v1-5-pruned-emaonly.safetensors
|   |   โ”œโ”€โ”€ gligen
|   |   |   โ”œโ”€โ”€ diffusion_pytorch_model_box.bin
|   |   โ”œโ”€โ”€ visorgpt
|   |   |   โ”œโ”€โ”€ visorgpt_dagger_ta_tb.pt

Step 3 - Run demo

CUDA_VISIBLE_DEVICES=0 python3 gradio_demo.py

Training

  1. Download the preprocessed json files at here.
  2. Process them into text corpora, e.g.,
# box type
python3 preprocess_coord.py --input_path path/to/coco_train.json --data_type box --output_dir txt_train
# keypoint type
python3 preprocess_coord.py --input_path path/to/cocokeypoints_train.json --data_type keypoint --output_dir txt_train
# mask type
python3 preprocess_coord.py --input_path path/to/coco_train.json --data_type mask --output_dir txt_train
  1. If you have processed several .txt files, you can merge them into one .txt file, e.g.,
python3 utiles/merge_files.py --file_dir txt_train --output_file_path train.txt
  1. Tokenize the text corpora.
cd train/
python3 preprocess.py --corpus_path ../train.txt \
                      --vocab_path models/google_uncased_en_coord_vocab.txt \
                      --dataset_path train.pt --processes_num 8 \
                      --seq_length 1024 --tgt_seq_length 1024 --data_processor lm
  1. Train GPT-2 (based) model. The training process requires 8 V100(32GB).
deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                    --dataset_path train.pt \
                    --vocab_path models/google_uncased_en_coord_vocab.txt \
                    --config_path models/gpt2/config.json \
                    --output_model_path train.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 200000 --save_checkpoint_steps 5000 --report_steps 100 \
                    --learning_rate 5e-5 --batch_size 16

Or you can directly download the tokenized data from here (around 340K sequences) and put it into the directory of train/.

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                    --dataset_path visorgpt_dagger_train_seq.pt \
                    --vocab_path models/google_uncased_en_coord_vocab.txt \
                    --config_path models/gpt2/config.json \
                    --output_model_path models/visorgpt_dagger_train_seq.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 200000 --save_checkpoint_steps 10000 --report_steps 100 \
                    --learning_rate 5e-5 --batch_size 16

Inference

CUDA_VISIBLE_DEVICES=0 python3 scripts/generate_lm_multiple.py --load_model_path models/visorgpt_dagger_train_seq.bin/200000/mp_rank_00_model_states.pt \
                               --vocab_path models/google_uncased_en_coord_vocab.txt \
                               --test_path beginning.txt --prediction_path generated_sentence.txt \
                               --config_path models/gpt2/config.json --seq_length 512
                               
or 
CUDA_VISIBLE_DEVICES=0 python3 scripts/generate_lm_multiple.py --load_model_path models/visorgpt_dagger_train_seq.bin \
                               --vocab_path models/google_uncased_en_coord_vocab.txt \
                               --test_path beginning.txt --prediction_path generated_sentence.txt \
                               --config_path models/gpt2/config.json --seq_length 512

Visualization

cd ../
python utils/seq2coord.py --file_path path/to/your/inference/txt --visualize

The visualization results will be saved at ./debug

If you are using our code, please consider citing our paper.

@inproceedings{xie2023learning,
title={Learning Visual Prior via Generative Pre-Training},
author={Jinheng Xie and Kai Ye and Yudong Li and Yuexiang Li and Kevin Qinghong Lin and Yefeng Zheng and Linlin Shen and Mike Zheng Shou},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
}

More Repositories

1

Awesome-Video-Diffusion

A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
3,195
star
2

Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Python
1,089
star
3

Tune-A-Video

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Python
1,010
star
4

Image2Paragraph

[A toolbox for fun.] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.
Python
781
star
5

MotionDirector

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.
Python
747
star
6

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Python
684
star
7

VideoSwap

Code for [CVPR 2024] VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
342
star
8

Awesome-MLLM-Hallucination

๐Ÿ“– A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
340
star
9

all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
Python
277
star
10

BoxDiff

[ICCV 2023] BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
Python
239
star
11

DeVRF

The Pytorch implementation of "DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes"
Python
179
star
12

EgoVLP

[NeurIPS2022] Egocentric Video-Language Pretraining
Python
140
star
13

Awesome-GUI-Agent

๐Ÿ’ป A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
109
star
14

Awesome-Unified-Multimodal-Models

๐Ÿ“– This is a repository for organizing papers, codes and other resources related to unified multimodal models.
106
star
15

ShowAnything

Jupyter Notebook
79
star
16

cosmo

Python
70
star
17

loveu-tgve-2023

Official GitHub repository for the Text-Guided Video Editing (TGVE) competition of LOVEU Workshop @ CVPR'23.
Python
68
star
18

sparseformer

(ICLR 2024, CVPR 2024) SparseFormer
Python
62
star
19

datacentric.vlp

Compress conventional Vision-Language Pre-training data
Python
48
star
20

Region_Learner

The Pytorch implementation for "Video-Text Pre-training with Learned Regions"
Python
42
star
21

ShowRoom3D

This is the project page of ShowRoom3D
24
star
22

Long-form-Video-Prior

Python
22
star
23

DemoVLP

[Arxiv2022] Revitalize Region Feature for Democratizing Video-Language Pre-training
Python
21
star
24

CLVQA

[AAAI2023 (Oral)] Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Python
19
star
25

BYOC

[IEEE-VR 2024] Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters
C#
19
star
26

Q2A

[ECCV 2022] AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
Python
18
star
27

HOSNeRF

This is the project page for the HOSNeRF
JavaScript
15
star
28

headshot

12
star
29

GEB-Plus

[ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Python
12
star
30

LOVA3

[NeurIPS 2024] "Learning to Visual Question Answering, Asking and Assessment"
Python
12
star
31

Show-Anything-3D

Edit and Generate Anything in 3D world!
11
star
32

Awesome-Long-Context

A curated list of resources about long-context in large-language models and video understanding.
10
star
33

SCT

[IJCV2023] Offical implementation of "SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels"
Python
10
star
34

VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Python
9
star
35

SOIS

The Pytorch implementation of "Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization"
8
star
36

AVA-AVD

Python
7
star
37

Efficient-CLS

[arXiv2022] Label-Efficient Online Continual Object Detection in Streaming Video
6
star
38

videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Python
6
star
39

Tune-An-Ellipse

[CVPR 2024] Tune-An-Ellipse: CLIP Has Potential to Find What You Want
6
star
40

mist

5
star
41

ColonNeRF

This is the project page for ColonNeRF.
JavaScript
4
star
42

DynVideo-E

This is the project page for DynVideo-E.
JavaScript
3
star
43

VideoLISA

[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
3
star
44

TTC-Tuning

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm
2
star
45

assistq

SCSS
1
star