Neural-Image-Caption-Generator

Automatically describing the content of an image fundamental problem in artificial intelligence that connects computer vision and natural language processing.

This project is maintained by dibyansu24-maker

Neural Image Caption Generator

Abstract

Automatically describing the content of an image fundamental problem in artificial intelligence that connects computer vision and natural language processing. Being able to build a model that could bridge these two fields will help us apply various techniques of each other to solve a fundamental problem in artificial intelligence. The paper which we worked on is Show and tell: A neural image caption generator [et vinyals] where we will use CNN and Recurrent neural networks such as LSTM to build an end to end model.

Keywords: NLP,Vision,CNN,RNN,LSTM

Credits:

The problem is addressed as a part of our internship at IIT Kanpur. It was a collaborative effort involving the following individuals:

  1. Dibyansu Diptiman Github link
  2. Nikhil Kumawat Github link

Table of Contents

Introduction

In the area of artificial intelligence, automatically describing the content of an image using properly formed English sentences is a challenging task. Leveraging the advances in recognition of objects allows us to drive natural language generation systems, but the current approaches have limited ability in their expressivity. For this paper (Show and tell- A Neural Image Caption Generator), the authors combined deep CNN for image classification with RNN for sequence modelling to create a single end to end neural network that generates a description of images. They take the image I as input and produced a target sequence of words S = {S1, S2, ….} by directly maximizing the likelihood p(S I). They used a deep CNN as an encoder function to produce a rich representation of the input image by embedding it to a fixed-length vector. This embedding vector will be used as the initial hidden state of a decoder RNN that will be used to generate the target sentence. They present an end-to-end system for this sentence caption generation problem. Their neural network is trainable by using basic Stochastic gradient descent algorithm or any other flavour of gradient descent. Finally, through experimental results, they show their method could perform significantly better than current (at that time now we also have attention models) state-of-art approaches.

Model Architecture

The goal is to directly maximize the probability of the correct description given the image by: Annotation 2020-08-03 152242

Equation explanation:

  1. (Ii,Si) is image-caption pair
  2. D is training dataset
  3. θ is the parameter of our model Now we will use chain rule over the above equation to model the joint probability logp(S|I,θ)over S, we could get:
    Annotation 2020-08-03 153156
    Sₜ is the t th word in the caption S. So the authors model the conditional probability using the LSTM(512 units), which is a special form of recurrent neural network.To be more specific, at the time step t-1, treat the hidden h(t-1) as a non-linear representation of I,θ, S0,…….., Sₜ-2 and given the word Sₜ-1, then calculate h(t-1)= f(h(t-2), S(t-1)).Finally, model p(Sₜ|I,θ,Sо,……….,Sₜ-1) using pt = Softmax(h(t-1)). The p is the conditional probability distribution over the whole dictionary, which suggests the word generated at time step t. One more thing that needs to be specifically addressed here is that authors used CNN to initialize S0. In our attempt to replicate the paper we have used InceptionV3 to get pre-trained weights and encode image vectors to initialize LSTM.

Dataset(Flickr 8k)

Due to limited computation power, we will train our model on Flickr 8k dataset.
This dataset contains 8000 images. We have divided our dataset in the following fashion:
Training set:6000
Validation/Dev set=1000
Test set=1000
Link for downloading Dataset: Flickr 8k Dataset Folder
Exploring dataset: Flickr 8k dataset contains 8092 images and up to five captions for each image.
For simplicity, we have provided the google drive link for the dataset. Moreover in the drive folder only we have separated and made Training, Validation and Test set.
Random dataset: We created our random dataset too, which has 30-40 random images from the internet to check our model efficiency in the real world. We will attach an analysis over that set too in our report.
Below is an example from Flickr 8k dataset:
Example Image Annotation 2020-07-16 220339

Training

1. Dataset Preprocessing:

Experimental Analysis(Hyperparameter Tuning)

We tried many hyperparameter combinations. We tried to reduce variance (the difference between training loss and validation loss) and bias (high error). We tried optimum hyperparameters to achieve low variance and low bias.

After trying a various number of epochs size and batch size we found that smaller batch size and epoch numbers were able to reduce overfitting. Since the dataset is limited and objective is to caption real-world images overgeneralization and overfitting were a major issue. This paper On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima can be referred to understand why smaller batch sizes help in avoiding sharp minima and overgeneralization. We concluded our batch_size to be 30/32 (batch size of power of 2 are preferred) No. of epochs was found to be optimum around 8-12. Adam optimizer is used. Two learning rates are discussed 0.001 and 0.0005 Categorical cross-entropy is used to calculate the loss. However, the number of epochs can be increased to decrease training loss but our objective was to reduce overgeneralization so that model could work well on real-world examples. Training loss was reduced to 2.6-2.8 range and validation loss being 2.9-3.3 range thus leading to a model with low variance thus low overfitting. In the inference section, we will also give examples of image captioning over different data and illustrate how low variance in our model leads to better performance.

Scope of Further Improvements

References

  1. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164,2015.
  2. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell Neural image caption generation with visual attention. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2048–2057, Lille, France, 07–09 Jul 2015. PMLR.
  3. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima. Published as a conference paper at ICLR, 15 Sep 2016.
  4. M. Hodosh, P. Young and J. Hockenmaier (2013) “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics”, Journal of Artificial Intelligence Research, Volume 47, pages 853-899, http://www.jair.org/papers/paper3994.html