Up-Down-Captioner
Simple yet high-performing image captioning model using Caffe and python. Using image features from bottom-up attention, in July 2017 this model achieved state-of-the-art performance on all metrics of the COCO captions test leaderboard (SPICE 21.5, CIDEr 117.9, BLEU_4 36.9). The architecture (2-layer LSTM with attention) is described in Section 3.2 of:
Reference
If you use this code in your research, please cite our paper:
@inproceedings{Anderson2017up-down,
author = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
title = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
booktitle={CVPR},
year = {2018}
}
License
This code is released under the MIT License (refer to the LICENSE file for details).
Requirements: software
-
Important
Please use the version of caffe provided as a submodule within this repository. It contains additional layers and features required for captioning. -
Requirements for
Caffe
andpycaffe
(see: Caffe installation instructions)Note: Caffe must be built with support for Python layers and NCCL!
# In your Makefile.config, make sure to have these lines uncommented WITH_PYTHON_LAYER := 1 USE_NCCL := 1 # Unrelatedly, it's also recommended that you use CUDNN USE_CUDNN := 1
-
Nvidia's NCCL library which is used for multi-GPU training https://github.com/NVIDIA/nccl
Requirements: hardware
By default, the provided training scripts assume that two gpus are available, with indices 0,1. Training on two gpus takes around 9 hours. Any NVIDIA GPU with 8GB or larger memory should be OK. Training scripts and prototxt files will require minor modifications to train on a single gpu (e.g. set iter_size
to 2).
Demo - Using the model to predict on new images
Run install instructions 1-4 below, then use the notebook at scripts/demo.ipynb
Installation
All instructions are from the top level directory. To run the demo, should be only steps 1-4 required (remaining steps are for training a model).
-
Clone the Up-Down-Captioner repository:
# Make sure to clone with --recursive git clone --recursive https://github.com/peteanderson80/Up-Down-Captioner.git
If you forget to clone with the
--recursive
flag, then you'll need to manually clone the submodules:git submodule update --init --recursive
-
Build Caffe and pycaffe:
cd ./external/caffe # If you're experienced with Caffe and have all of the requirements installed # and your Makefile.config in place, then simply do: make -j8 && make pycaffe
-
Build the COCO tools:
cd ./external/coco/PythonAPI make
-
Add python layers and caffe build to PYTHONPATH:
cd $REPO_ROOT export PYTHONPATH=${PYTHONPATH}:$(pwd)/layers:$(pwd)/lib:$(pwd)/external/caffe/python
-
Build Ross Girshick's Cython modules (to run the demo on new images)
cd $REPO_ROOT/lib make
-
Download Stanford CoreNLP (required by the evaluation code):
cd ./external/coco-caption ./get_stanford_models.sh
-
Download the MS COCO train/val image caption annotations. Extract all the json files into one folder
$COCOdata
, then create a symlink to this location:cd $REPO_ROOT/data ln -s $COCOdata coco
-
Pre-process the caption annotations for training (building vocabs etc).
cd $REPO_ROOT python scripts/preprocess_coco.py
-
Download or generate pretrained image features following the instructions below.
Pretrained image features
LINKS HAVE BEEN UPDATED
The captioner takes pretrained image features as input (and does not finetune). For best performance, bottom-up attention features should be used. Code for generating these features can be found here. For ease-of-use, we provide pretrained features for the MSCOCO dataset. Manually download the following tsv file and unzip to data/tsv/
:
To make a test server submission, you would also need these features:
Alternatively, to generate conventional pretrained features from the ResNet-101 CNN:
- Download the pretrained ResNet-101 model and save it in
baseline/ResNet-101-model.caffemodel
- Download the MS COCO train/val images, and extract them into
data/images
. - Run:
cd $REPO_ROOT
./scripts/generate_baseline.py
Training
To train the model on the karpathy training set, and then generate and evaluate captions on the karpathy testing set (using bottom-up attention features):
cd $REPO_ROOT
./experiments/caption_lstm/train.sh
Trained snapshots are saved under: snapshots/caption_lstm/
Logging outputs are saved under: logs/caption_lstm/
Generated caption outputs are saved under: outputs/caption_lstm/
Scores for the generated captions (on the karpathy test set) are saved under: scores/caption_lstm/
To train and evaluate the baseline using conventional pretrained features, follow the instructions above but replace caption_lstm
with caption_lstm_baseline_resnet
.
Results
Results (using bottom-up attention features) should be similar to the numbers below (as reported in Table 1 of the paper).
BLEU-1 | BLEU-4 | METEOR | ROUGE-L | CIDEr | SPICE | |
---|---|---|---|---|---|---|
Cross-Entropy Loss | 77.2 | 36.2 | 27.0 | 56.4 | 113.5 | 20.3 |
CIDEr Optimization | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 |
Other useful scripts
-
scripts/create_caption_lstm.py
The version of caffe provided as a submodule with this repo includes (amongst other things) a customLSTMNode
layer that enables sampling and beam search through LSTM layers. However, the resulting network architecture prototxt files are quite complicated. The filescripts/create_caption_lstm.py
scaffolds out network structures, such as those inexperiments
. -
layers/efficient_rcnn_layers.py
The providednet.prototxt
file uses a python data layer (layers/rcnn_layers.py
) that loads all training data (including image features) into memory. If you have insufficient system memory use this python data layer instead, by replacingmodule: "rcnn_layers"
withmodule: "efficient_rcnn_layers"
inexperiments/caption_lstm/net.prototxt
. -
scripts/plot.py
Basic script for plotting validation set scores during training.