Discriminability objective for training descriptive captions
This is the implementation of paper Discriminability objective for training descriptive captions.
Requirements
Python 2.7 (because there is no coco-caption version for python 3)
PyTorch 1.0 (along with torchvision)
java 1.8 for (coco-caption)
Downloads
Clone the repository
git clone --recursive https://github.com/ruotianluo/DiscCaptioning.git
Data split
In this paper we use the data split from Context-aware Captions from Context-agnostic Supervision. It's different from standard karpathy's split, so we need to download different files.
Download link: Google drive link
To train on your own, you only need to download dataset_coco.json
, but it's also suggested to download cocotalk.json
and cocotalk_label.h5
as well. If you want to run pretrained model, you have to download all three files.
coco-caption
cd coco-caption
bash ./get_stanford_models.sh
cd annotations
# Download captions_val2014.json from the google drive link above to this folder
cd ../../
The reason why we need to replace the captions_val2014.json
is because the original file can only evaluate images from the val2014 set, and we are using rama's split.
Pre-computed feature
In this paper, for retrieval model, we use outputs of last layer of resnet-101. For captioning model, we use the bottom-up feature from https://arxiv.org/abs/1707.07998.
The features can be downloaded from the same link, and you need to compress them to data/cocotalk_fc
and data/cocobu_att
respectively.
Pretrained models.
Download pretrained models from link. Decompress them into root folder.
To evaluate on pretrained model, run:
bash eval.sh att_d1 test
The pretrained models can match the results shown in the paper.
Train on you rown
Preprocessing
Preprocess the captions (skip if you already have 'cocotalk.json' and 'cocotalk_label.h5'):
$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk
Preprocess for self-critical training:
$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train
Start training
First train a retrieval model:
bash run_fc_con.sh
Second, pretrain the captioning model.
bash run_att.sh
Third, finetune the captioning model with cider+discriminability optimization:
bash run_att_d.sh 1
(1 is the discriminability weight, and can be changed to other values)
Evaluate
bash eval.sh att_d1 test
Citation
If you found this useful, please consider citing:
@InProceedings{Luo_2018_CVPR,
author = {Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
title = {Discriminability Objective for Training Descriptive Captions},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}
Acknowledgements
The code is based on ImageCaptioning.pytorch