neural-vqa
This is an experimental Torch implementation of the VIS + LSTM visual question answering model from the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros & Richard Zemel.
Setup
Requirements:
Download the MSCOCO train+val images and VQA data using sh data/download_data.sh
. Extract all the downloaded zip files inside the data
folder.
unzip Annotations_Train_mscoco.zip
unzip Questions_Train_mscoco.zip
unzip train2014.zip
unzip Annotations_Val_mscoco.zip
unzip Questions_Val_mscoco.zip
unzip val2014.zip
If you had them downloaded already, copy over the train2014
and val2014
image folders
and VQA JSON files to the data
folder.
Download the VGG-19 Caffe model and prototxt using sh models/download_models.sh
.
Known issues
- To avoid memory issues with LuaJIT, install Torch with Lua 5.1 (
TORCH_LUA_VERSION=LUA51 ./install.sh
). More instructions here. - If working with plain Lua, luaffifb may be needed for loadcaffe, unless using pre-extracted fc7 features.
Usage
Extract image features
th extract_fc7.lua -split train
th extract_fc7.lua -split val
Options
batch_size
: Batch size. Default is 10.split
: train/val. Default istrain
.gpuid
: 0-indexed id of GPU to use. Default is -1 = CPU.proto_file
: Path to thedeploy.prototxt
file for the VGG Caffe model. Default ismodels/VGG_ILSVRC_19_layers_deploy.prototxt
.model_file
: Path to the.caffemodel
file for the VGG Caffe model. Default ismodels/VGG_ILSVRC_19_layers.caffemodel
.data_dir
: Data directory. Default isdata
.feat_layer
: Layer to extract features from. Default isfc7
.input_image_dir
: Image directory. Default isdata
.
Training
th train.lua
Options
rnn_size
: Size of LSTM internal state. Default is 512.num_layers
: Number of layers in LSTMembedding_size
: Size of word embeddings. Default is 512.learning_rate
: Learning rate. Default is 4e-4.learning_rate_decay
: Learning rate decay factor. Default is 0.95.learning_rate_decay_after
: In number of epochs, when to start decaying the learning rate. Default is 15.alpha
: Alpha for adam. Default is 0.8beta
: Beta used for adam. Default is 0.999.epsilon
: Denominator term for smoothing. Default is 1e-8.batch_size
: Batch size. Default is 64.max_epochs
: Number of full passes through the training data. Default is 15.dropout
: Dropout for regularization. Probability of dropping input. Default is 0.5.init_from
: Initialize network parameters from checkpoint at this path.save_every
: No. of iterations after which to checkpoint. Default is 1000.train_fc7_file
: Path to fc7 features of training set. Default isdata/train_fc7.t7
.fc7_image_id_file
: Path to fc7 image ids of training set. Default isdata/train_fc7_image_id.t7
.val_fc7_file
: Path to fc7 features of validation set. Default isdata/val_fc7.t7
.val_fc7_image_id_file
: Path to fc7 image ids of validation set. Default isdata/val_fc7_image_id.t7
.data_dir
: Data directory. Default isdata
.checkpoint_dir
: Checkpoint directory. Default ischeckpoints
.savefile
: Filename to save checkpoint to. Default isvqa
.gpuid
: 0-indexed id of GPU to use. Default is -1 = CPU.
Testing
th predict.lua -checkpoint_file checkpoints/vqa_epoch23.26_0.4610.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question 'What is the cat on?'
Options
checkpoint_file
: Path to model checkpoint to initialize network parameters frominput_image_path
: Path to input imagequestion
: Question string
Sample predictions
Randomly sampled image-question pairs from the VQA test set, and answers predicted by the VIS+LSTM model.
Q: What animals are those? A: Sheep
Q: What color is the frisbee that's upside down? A: Red
Q: What is flying in the sky? A: Kite
Q: What color is court? A: Blue
Q: What is in the standing person's hands? A: Bat
Q: Are they riding horses both the same color? A: No
Q: What shape is the plate? A: Round
Q: Is the man wearing socks? A: Yes
Q: What is over the woman's left shoulder? A: Fork
Q: Where are the pink flowers? A: On wall
Implementation Details
- Last hidden layer image features from VGG-19
- Zero-padded question sequences for batched implementation
- Training questions are filtered for
top_n
answers,top_n = 1000
by default (~87% coverage)
Pretrained model and data files
To reproduce results shown on this page or try your own
image-question pairs, download the following and run
predict.lua
with the appropriate paths.
- vqa_epoch23.26_0.4610.t7 (Serialized using Lua51) [GPU] [CPU]
- answers_vocab.t7
- questions_vocab.t7
- data.t7
References
- Exploring Models and Data for Image Question Answering, Ren et al., NIPS15
- VQA: Visual Question Answering, Antol et al., ICCV15