• Stars
    star
    840
  • Rank 54,265 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 1 year ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Kaizhi Zheng* , Xuehai He* , Xin Eric Wang

University of California, Santa Cruz

teaser

Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of ``generative vokens", acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks.

Model Architecture

arch

Getting Started

Installation

1. Download repo and create environment

Clone our repo and create a new python environment.

git clone https://github.com/eric-ai-lab/MiniGPT-5.git
cd MiniGPT-5
conda create -n minigpt5 python=3.9
conda activate minigpt5
pip install -r requirements.txt

2. Prepare the pretrained weights

Our model is based on the pretrained MiniGPT-4 (including Vicuna and BLIP-2). Please download Vicuna V0 7B weights. Then, set the path to the vicuna weight in the model config file at Line 16.

Since the Pretrained MiniGPT-4 Aligned Checkpoint is small, we already download in config folder, and the model path is set in config file at Line 10.

3. Download MiniGPT-5 Checkpoint

Since our model is trained with two stages (Stage 1: Unimodal Alignment Stage, Stage 2: Multimodal Learning Stage), we provide both two-stage checkpoints here:

Stage 1: CC3M Stage 2: VIST Stage 2: MMDialog
Download Download Download

Stage 2 needs the pretrained weights in Stage 1, so always download Stage 1 weights first.

Please download these weights into a single folder, and we will call this folder as WEIGHT_FOLDER in the following sections.

Demo

We provide a python file to try our model. This file will generate multimodal outputs under the example folder by taking a two-turn multimodal inputs.

cd examples
export IS_STAGE2=True
python3 playground.py --stage1_weight WEIGHT_FOLDER/stage1_cc3m.ckpt 
                        --test_weight WEIGHT_FOLDER/stage2_vist.ckpt

Evaluation

Our model evaluate on three datasets: CC3M, VIST, and MMDialog. Due to the license, we only share some dataset examples under the datasets folder. If you want to fully test the performance, please download the full dataset and format into the same data structures under the datasets folder.

1. Stage 1: Unimodal Alignment Stage (CC3M) evaluation

During this stage, the goal is to generate correct images by giving image descriptions.

Generation (If you have more than one gpus, you can set gpus to 0,1,2...):

export IS_STAGE2=False
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/CC3M
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path cc3m_val.tsv 
                        --test_weight stage1_cc3m.ckpt
                        --gpus 0

Calculate Metric:

export CC3M_FOLDER=datasets/CC3M
python3 metric.py --test_weight stage1_cc3m.ckpt

2. Stage 2: Multimodal Learning Stage (VIST) evaluation

Model will take the previous multimodal story sequences and generate either unimodal or multimodal outputs. Here, the default code is about multimodal input & image generation. To test other settings, please remove the not test condition in Line 280.

Generation:

export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/VIST
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path val_cleaned.json 
                        --test_weight stage2_vist.ckpt
                        --stage1_weight stage1_cc3m.ckpt
                        --gpus 0

Calculate Metric:

python3 metric.py --test_weight stage2_vist.ckpt

3. Stage 2: Multimodal Learning Stage (MMDialog) evaluation

Model will take previous turn multimodal inputs and generate multimodal response for multimodal conversations.

Generation:

export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/MMDialog
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path test/test_conversations.txt 
                        --test_weight stage2_mmdialog.ckpt
                        --stage1_weight stage1_cc3m.ckpt
                        --gpus 0

Calculate Metric:

python3 metric.py --test_weight stage2_mmdialog.ckpt

Training

1. Stage 1 training

Download the CC3M dataset and format into the same data structure in dataset folder.

Then, we use test data as example:

export IS_STAGE2=False
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/CC3M
python3 train_eval.py --is_training True
                        --train_data_path cc3m_val.tsv
                        --val_data_path cc3m_val.tsv
                        --model_save_name stage1_cc3m_{epoch}-{step}
                        --gpus 0

2. Stage 2 training

Download the VIST or MMDialog datasets and format into the same data structure in dataset folder.

Here we use VIST test data as example:

export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/VIST
python3 train_eval.py --is_training True
                        --train_data_path val_cleaned.json
                        --val_data_path val_cleaned.json
                        --stage1_weight stage1_cc3m.ckpt
                        --model_save_name stage2_vist_{epoch}-{step}
                        --gpus 0

If you find MiniGPT-5 useful in your research or applications, please cite as below:

@misc{zheng2023minigpt5,
      title={MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens}, 
      author={Kaizhi Zheng and Xuehai He and Xin Eric Wang},
      year={2023},
      journal={arXiv preprint arXiv:2310.02239}
}

More Repositories

1

photoswap

Official implementation of the NeurIPS 2023 paper "Photoswap: Personalized Subject Swapping in Images"
Jupyter Notebook
307
star
2

PEViT

Official implementation of AAAI 2023 paper "Parameter-efficient Model Adaptation for Vision Transformers"
Python
94
star
3

CPL

Official implementation of our EMNLP 2022 paper "CPL: Counterfactual Prompt Learning for Vision and Language Models"
Python
32
star
4

ComCLIP

Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"
Python
30
star
5

Discffusion

Official repo for the paper "Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners"
Python
26
star
6

llm_coordination

Code repository for the paper "LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models"
Python
21
star
7

Screen-Point-and-Read

Code repo for "Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding"
Python
19
star
8

MMWorld

Official repo of the paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
Python
18
star
9

Aerial-Vision-and-Dialog-Navigation

Codebase of the ACL 2023 (Findings) Paper "Aerial Vision-and-Dialog Navigation"
Python
14
star
10

FedVLN

[ECCV 2022] Official pytorch implementation of the paper "FedVLN: Privacy-preserving Federated Vision-and-Language Navigation"
C++
13
star
11

Mitigate-Gender-Bias-in-Image-Search

Code for the EMNLP 2021 Oral paper "Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search" https://arxiv.org/abs/2109.05433
Python
12
star
12

ACLToolBox

Python
8
star
13

PECTVLM

Code implementation for Findings of EMNLP 2023 paper "Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment"
Smalltalk
7
star
14

T2IAT

T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation
Python
7
star
15

MSSBench

Official codebase for the paper "Multimodal Situational Safety"
Python
6
star
16

Naivgation-as-wish

Official implementation of the NAACL 2024 paper "Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning"
Python
5
star
17

ViCor

This is the implementation of ACL 2024 Findings paper ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
3
star
18

via-video

1
star