• Stars
    star
    233
  • Rank 171,258 (Top 4 %)
  • Language
    Python
  • License
    Other
  • Created over 2 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)

Fine-grained Image Captioning with CLIP Reward

teaser image

Code structure

# Configurations
./configs/
    # MLE
    phase1/
    # RL
    phase2/

# COCO caption evaluation
./cider
./coco-caption

# Preprocessing
./clip # CLIP feature extractor
./scripts # COCO preprocessing
./scripts_FineCapEval # FineCapEval preprocessing
./data # Storing preprocessed features

# Core model / Rewards / Data loading
./captioning

# Training / Evaluation
./tools

# Fine-tuning CLIP Text encoder
./retrieval

# Pretrained checkpoints
./save

# Storing original dataset files
./datasets

Setup

Install Dependencies

# Create python environment (optional)
conda create -n clip4caption python=3.7
source activate clip4caption

# python dependenceies
pip install -r requirements.txt

## Install this repo as package
pip install -e .

# Install Detectron2 (optional for training utilities)
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

# Setup coco-caption (optional for text metric evaluation)
git clone https://github.com/clip-vil/cider
git clone https://github.com/clip-vil/coco-caption

cd coco-caption
bash get_stanford_models.sh
bash get_google_word2vec_model.sh

# Install java (optional for METEOR evaluation as part of text metrics)
sudo apt install default-jre

Download Pretrained models

We host model checkpoints via google drive. Download checkpoints as below. The .ckpt file size for captioning and CLIP models are 669.65M and 1.12G, respectively.

# Captioning model
./save/
    clipRN50_cider/
        clipRN50_cider-last.ckpt
    clipRN50_cider_clips/
        clipRN50_cider_clips-last.ckpt
    clipRN50_clips/
        clipRN50_clips-last.ckpt
    clipRN50_clips_grammar/
        clipRN50_clips_grammar-last.ckpt
    clipRN50_mle/
        clipRN50_mle-last.ckpt

# Finetuned CLIP Text encoder
./retrieval/
    save/
        clip_negative_text-last.ckpt

Dataset preparation

# Original dataset files - to be downloaded
./datasets/
    # Download from http://mscoco.org/dataset/#download
    COCO/
        images/
            train2014/
            val2014/
        annotations/
            captions_train2014.json
            captions_val2014.json

    # Download from https://drive.google.com/drive/folders/1jlwInAsVo-PdBdJlmHKPp34dLnxIIMLx
    FineCapEval/
        images/
            XXX.jpg

MS COCO

  • Download files
./datasets/
    # Download from http://mscoco.org/dataset/#download
     COCO/
        images/
            train2014/
            val2014/
        annotations/
            captions_train2014.json
            captions_val2014.json


./data/
    # Download from http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip
    dataset_coco.json

    # Download from from https://drive.google.com/drive/folders/1eCdz62FAVCGogOuNhy87Nmlo5_I0sH2J
    coco-train-words.p
  • Text processing
python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk
  • Visual feature extraction
python scripts/clip_prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root datasets/COCO/images --model_type RN50

# optional (n_jobs)
--n_jobs 4 --job_id 0
  • Visual fetaure extraction for CLIP-S Reward
python scripts/clipscore_prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root datasets/COCO/images

# optional (n_jobs)
--n_jobs 4 --job_id 0

FineCapEval

./datasets/
    FineCapEval/
        images/
            XXX.jpg

./data/
    FineCapEval.json
    FineCapEval.csv
  • Visual feature extraction
python scripts_FineCapEval/clip_prepro_feats.py --input_json data/FineCapEval.json --output_dir data/FineCapEval --images_root datasets/FineCapEval/images --model_type RN50

# optional (n_jobs)
--n_jobs 4 --job_id 0

Training and Evaluation

1) MLE training

export MLE_ID='clipRN50_mle'

# Training
python tools/train_pl.py --cfg configs/phase1/$MLE_ID.yml --id $MLE_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase1/$MLE_ID.yml --id $MLE_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$MLE_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward mle
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/clipRN50_mle.json

2) RL finetuning

Reward: CIDEr

export REWARD='cider'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Reward: CLIP-S

export REWARD='clips'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Reward: CLIP-S + CIDEr

export REWARD='clips_cider'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Reward: CLIP-S + Grammar

  1. Run CLIP Finetuning (for grammar) following ./retrieval/README.md

  2. Run RL training using the updated CLIP

export REWARD='clips_grammar'
export MLE_ID='clipRN50_mle'
export RL_ID='clipRN50_'$REWARD

# Copy MLE checkpoint as starting point of RL finetuning
mkdir save/$RL_ID
cp save/$MLE_ID/$MLE_ID-last.ckpt save/$RL_ID/$RL_ID-last.ckpt

# Training
python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Evaluation
EVALUATE=1 python tools/train_pl.py --cfg configs/phase2/$RL_ID.yml --id $RL_ID

# Text-to-Iage Retrieval with CLIP VIT-B/32
python tools/eval_clip_retrieval.py --gen_caption_path "./eval_results/$RL_ID.json"

# Evaluation on FineCapEval
python tools/finecapeval_inference.py --reward $REWARD
python tools/eval_finecapeval.py --generated_id2caption ./FineCapEval_results/$RL_ID.json

Acknowledgments

We thank the developers of CLIP-ViL, ImageCaptioning.pytorch, CLIP, coco-caption, cider for their public code release.

Reference

Please cite our paper if you use our models in your works:

@inproceedings{Cho2022CLIPReward,
  title     = {Fine-grained Image Captioning with CLIP Reward},
  author    = {Jaemin Cho and Seunghyun Yoon and Ajinkya Kale and Franck Dernoncourt and Trung Bui and Mohit Bansal},
  booktitle = {Findings of NAACL},
  year      = {2022}
}

More Repositories

1

VL-T5

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)
Python
357
star
2

tf_tutorial_plus

Tutorials for TensorFlow APIs the official documentation doesn't cover
Jupyter Notebook
290
star
3

Adversarial_Video_Summary

Unofficial PyTorch Implementation of SUM-GAN from "Unsupervised Video Summarization with Adversarial LSTM Networks" (CVPR 2017)
Python
239
star
4

DL-for-Chatbot

Deep Learning / NLP tutorial for Chatbot Developers
Jupyter Notebook
228
star
5

DallEval

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (ICCV 2023)
Jupyter Notebook
136
star
6

HiREST

Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)
Python
88
star
7

Dropouts

PyTorch Implementations of Dropout Variants
Jupyter Notebook
86
star
8

MoChA-pytorch

PyTorch Implementation of "Monotonic Chunkwise Attention" (ICLR 2018)
Python
76
star
9

DSG

Davidsonian Scene Graph (DSG) for Text-to-Image Evaluation (ICLR 2024)
Jupyter Notebook
72
star
10

VPGen

Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)
Jupyter Notebook
50
star
11

PixelCNN

PyTorch implementation of PixelCNN from "Pixel Recurrent Neural Networks"
Python
38
star
12

Easy-Namuwiki-Extractor

Easy Namuwiki Extractor
Python
28
star
13

IterInpaint

Code for IterInpaint model, presented in Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation (CVPR 2024 workshop)
Jupyter Notebook
23
star
14

WikiExtractor_To_the_one_text

Simple extension of WikiExtractor(https://github.com/attardi/wikiextractor)
Python
16
star
15

conv_s2s

Convolutional Sequence-to-Sequence (Work in Progress)
Jupyter Notebook
10
star
16

pytorch_exercise

Jupyter Notebook
7
star
17

pytorch-docker

Dockerfile
3
star
18

generative_models

PyTorch Implementations of Generative models
Python
3
star
19

LayoutBench

Evaluation code for LayoutBench, presented in Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation (CVPR 2024 Workshop)
Python
2
star
20

dotfiles

my dotfile configurations
Shell
2
star
21

LayoutBench-COCO

Jupyter Notebook
2
star
22

ml_snips

Custom ML snippets
Python
1
star
23

motion2csv

Recording Motion with skeleton coordinates using MS Kinect v2
C++
1
star
24

pydropbox

Simple Wrapper for Dropbox Python API
Python
1
star
25

arxiv_reader_slackbot

Slack bot that reads Arxiv abstract from urls.
Python
1
star
26

Udacity_Deep-Learning_Assignments

Udacity Deep Learning course assignments (based on Tensorflow)
Jupyter Notebook
1
star
27

Math-for-IE

Codes for Mathematical Methods for Industrial and Management Engineering (μ‚°μ—…κ²½μ˜μˆ˜λ¦¬κΈ°λ²•) @ SNU 2017-F
Jupyter Notebook
1
star