Visual Question Answering with Keras

Recent developments in Deep Learning has paved the way to accomplish tasks involving multimodal learning. Visual Question Answering (VQA) is one such challenge which requires high-level scene interpretation from images combined with language modelling of relevant Q&A. Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. This is a Keras implementation of one such end-to-end system to accomplish the task.

Checkout the demo here:

Architecture

The learning architecture behind this demo is based on the model proposed in the VQA paper.

The problem is considered as a classification task here, wherein, 1000 top answers are chosen as classes. Images are transformed by passing it through the VGG-19 model that generates a 4096 dimensional vector in the second last layer. The tokens in the question are first embedded into 300 dimensional GloVe vectors and then passed through 2 layer LSTMs. Both multimodal data points are then passed through a dense layer of 1024 units and combined using point-wise multiplication. The new vector serves as input for a fully-connected model having a tanh and a final softmax layer.

Data

Preprocessed features provided by VT Vision Lab was used which consisted of images transformed through VGG19 model and indexed tokens.

Installation

The following packages need to be installed before running the scripts:

Keras (and the corresponding backend: Theano/TensorFlow)
h5py

Then go to the data folder and download the requirements given over there.

Training

Run python train.py along with the following optional parameters: --epoch, --batch_size, --data_limit.

To evaluate the model on validation set, run python train.py --type val.

Training Details

Preprocessed features have been used based on these scripts written by the VT vision lab team. These features already consist of transformed image vectors, indexed tokens for text and other metadata, for both the training and validation set.

Training was done on g2.2xlarge spot instance of AWS. Mutltiple commuity AMIs can be found having all the required packages pre-installed. g2.2xlarge has a NVIDIA Grid K520 with 4GB memory and takes ~277 seconds/epoch for a batch size of 256. The model has been trained on 50 epochs and has a accuracy of 45.03% on the validation set. Also, the accuracy started decreasing after 70 epochs. Thus, there is a lot of scope for hyper-parameter tuning here.

Running the application

For details on how to run the demo app, check the docs in app/ folder.

Feedback

If you have any feedback or suggestions, do ping me at [email protected]

anantzoid/VQA-Keras-Visual-Question-Answering

anantzoid

Reviews

Repository Details