Image-Caption-Generation-using-Deep-Learning-14k-Flickr-Dataset-
With current scenario in 2020, where quarantine is the buzz word and work-from-home has become a norm, there is an increasing usage of internet. At the touch of a screen we can order groceries, medicines etc. However, not everyone is fortunate enough to use it as seamlessly. For example, the people who suffer from impaired vision might find it cumbersome and frustrating to distinguish between blueberries and grapes. This project aims to create a neural network model that can help such demographics. The complexity and novelty in creating such a model is that it should not simply detect the object but also give useful and accurate information about that object. Hence, this project proposes an ‘Image caption generator (using deep learning)’ that processes the image and describes it in a short sentence using a natural language such as English. The model is an amalgamation of two types of neural networks, CNN (Convolutional Neural Network) for image processing and LSTM (Long short-term memory), a type of Recurrent Neural Network, for text processing. A subset of 14,000 images, along with their sample captions, has been selected from Flickr_30K dataset. The generated caption is evaluated using human judgement as well as BLEU-1 score. Furthermore, the model has been trained and tested with several variations such as incorporation of pre-trained GloVe embeddings, different dropout and regularizer rates, and two types of feature extraction models for images: Xception and VG16. Most relevant and fitting captions were obtained using features from Xception model with an encoder-decoder based architecture. Highest BLEU-1 scores (above 0.5 on a scale of 0 to 1) were obtained with VG16 model using GloVe embeddings.