Grounding Language Models to Images for Multimodal Inputs and Outputs
This repository hosts the code and model weights for FROMAGe.
Paper | Project Webpage | Demo
Setup instructions
Environment
Set up a new virtualenv, and install required libraries:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Add the fromage
library to PYTHONPATH:
export PYTHONPATH=$PYTHONPATH:/home/path/to/fromage/
Pretrained Checkpoints
The FROMAGe model weights (linear layers and [RET] embedding) are small (around 11MB), and are included in this Git repo. They will be in the fromage_model/
folder after cloning. The checkpoint and model config in fromage_model/
reproduce the results reported in our paper.
We have also included a second model trained with a stronger visual linear layer (4 visual tokens instead of 1), located at fromage_model/fromage_vis4
. This model generally does better on dialogue settings and does not require as much tuning of inference time hyperparameters, as it is able to better represent more complex images.
Precomputed Embeddings For Image Retrieval
The visual embeddings for Conceptual Captions images with valid URLs are precomputed and stored at this URL. These are used to enable the model to retrieve images. The embeddings take up around 3GB, and are compatible with both model configs we provide. Download the files and place cc3m_embeddings.pkl
into the fromage_model/
directory.
Inference
Check out FROMAGe_example_notebook.ipynb
for examples on calling the model for inference. Several of the figures presented in the paper are reproduced in this notebook using greedy decoding of the model. Note that there may be minor differences in image outputs due to CC3M images being lost over time.
Training
Preparing CC3M
Our model is trained on the Conceptual Captions dataset. After following the instructions on the website to download the captions and images, format it into a .tsv
file as follows:
caption image
A picture of a cat cat.png
Mountains mountain.png
where each line contains the caption followed by the filename of the image files. Save these .tsv
files into the dataset/
folder (the default names expected are cc3m_train.tsv
and cc3m_val.tsv
). The repo contains two placeholder files, and you will have to replace them with the appropriate data.
The corresponding image files should be saved in the data/
directory. The directory can be changed with the --image-dir
runtime flag.
Training FROMAGe
After preparing CC3M as detailed above, you can start a new training job with the following command line flag:
randport=$(shuf -i8000-9999 -n1) # Generate a random port number
python -u main.py \
--dist-url "tcp://127.0.0.1:${randport}" --dist-backend 'nccl' \
--multiprocessing-distributed --world-size 1 --rank 0 \
--dataset=cc3m --val-dataset=cc3m \
--opt-version='facebook/opt-6.7b' --visual-model='openai/clip-vit-large-patch14' \
--exp_name='fromage_exp' --image-dir='data/' --log-base-dir='runs/' \
--batch-size=180 --val-batch-size=100 --learning-rate=0.0003 --precision='bf16' --print-freq=100
On a single A6000 GPU, the model converges within 24 hours (with a batch size of 180). For GPUs with smaller memory available, you might need to reduce the batch size, enable gradient accumulation, or adjust hyperparameters to get good performance. You may also have to disable NCCL P2P with export NCCL_P2P_DISABLE=1
if you run into issues.
Pruning Model Weights
As FROMAGe only consists of a few pretrained linear layers and the [RET]
embedding, we can discard most of the pretrained weights to save on disk space. If you have trained a new model, and wish to do so, you can use fromage/prune_model_ckpt.py
to prune the model weights. We used the same script to create the weights in the fromage_model
directory.
Unit Tests
You can also test that the code runs locally by running the unit test with pytest -x
. This runs a short training and evaluation job, with smaller models, to ensure the code works. The test should complete within approximately 90s.
Note that because of exception catching (to ensure data errors don't terminate training), the test will silently fail and not terminate if there is an I/O error when reading data. Hence, we recommend running the Python command above for debugging data preprocessing.
Evaluation
We provide an evaluation script to reproduce our results on contextual image retrieval on Visual Storytelling (results of Table 1 of our paper). The script can be run from evals/eval_vist_retrieval.py
. There is also a iPython notebook version (VIST_Contextual_Image_Retrieval.ipynb
) in the same directory.
Similarly, we provide scripts to reproduce the text generation and image retrieval results on VisDial (presented in Table 2 of our paper). The script for VisDial text generation can be run from evals/eval_visdial_generation.py
(or through the notebook version, VisDial_Inference_IT2T_Generation.ipynb
). This reports the NDCG, MRR, and R@k scores for VisDial.
The results for image retrieval can be reproduced by running the evals/eval_visdial_retrieval.py
script (or through the notebook version VisDial_Inference_T2I_Retrieval.ipynb
), which reports R@k scores.
Gradio Demo
You can launch your own version of the Gradio demo locally by running python demo/app.py
, or duplicating the HuggingFace space.
Check out other unofficial HuggingFace spaces for FROMAGe:
Citation
If you find this work useful, please consider citing:
@inproceedings{koh2023grounding,
title={Grounding Language Models to Images for Multimodal Inputs and Outputs},
author={Koh, Jing Yu and Salakhutdinov, Ruslan and Fried, Daniel},
journal={ICML},
year={2023}
}