Image captioning is the task of generating a text description for a given input image. For example, given the image below, we would like to automatically generate a description, such as: "a dog running in the snow".
However, there is no single solution to an image captioning problem, since given the above image, all following captions are correct:
But they all accurately describe the image. In essence, we would like to use a method that would also describe the image, as accurately as humans. How can we do this? In essense this can be reduced to a single classification problem at every timestep, where given an image and a sequence of past words the model predicts the next word in the sequence. We then "feed" the predicted word again into our model, produce the next word etc. The process is depicted below:
<end>
token?¶This can happen in practice. For example a model can get stuck in repeating the same sequence of words. The most common way to address this is to stop caption generation after a fixed number of steps (e.g. stop if no <end>
token is seen after generating 30 words).
At this point we assume that the reader is familiar with Recurrent Neural Networks (RNNs), if not here are some excellent sources of information:
You might have noticed above that we abstracted the method used, simply as "Network". There are a number of architectures used in the literature but have one thing in common: they use some sort of RNN. Another thing they have in common is that instead of presenting the "raw" image to the captioning network, they distill each image into a set of image features using a pre-trained CNN on ImageNet such as VGG. There are two main sub-architectures of captioning networks, depending on the role of the RNN component. Following the naming convention of Tanti et al. (2017) (2018) these sub-architectures are:
We know from literature that merge architectures are comparable to inject architectures in terms of performance with one added benefit: they are a bit faster (for the same number of recurrent units). Since the RNN only handles text in Merge architectures, it tends to have smaller input size, hence a smaller number of parameters. Fewer parameters to train lead to faster training times. Here we will reproduce a subset of the results of Tanti et al. 2017. Specifically we will focus only on the Flickr8k dataset and will compare the methods using BLEU scores only. While Tanti et al. only use an LSTM Hochreiter et al. 1997 for the RNN component of the captioning network, we additionally repeat the experiments using a GRU Cho et al. 2014 instead.
Here are the results:
From now on we focus on our "best" model: Merge architecture with GRU. Best meaning: comparable performance but faster than the rest. Now let's sprinkle a bit of dropout after every layer and see how the model reacts:
Now we can confirm that the performance of Inject and Merge architectures is comparable and Merge architectures are bit faster. We also saw that using a GRU instead of an LSTM can be a good idea for this specific problem. Now let's have some fun and generate captions for new images. I downloaded some images from https://www.pexels.com where the licence (as of July 2018) states that "The pictures are free for personal and even for commercial use". Here are captioned images ranked from best to worst:
The network gets this one right! In general there are many images of dogs in the Flickr8k dataset (I guess people like sharing photos of their dogs on the internet...) so the network usually gets these right, unless the background is too complex.
The network is very good at detecting men, especially in red shirts. It just works it this case. The "in front of the building" part is debatable (since there are only a couple of columns in the background) but not wrong.
In this case the network is correct about a man jumping, he has a white shirt. There is however a woman with a red dress jumping next to him. There is no rock, but the network has probably learned that when people jump into the water, they do so from rocks, which is not always the case.
The network gets detects the water in the image, but mistakes the standing dog for a man in a red shirt.
Our network seems to hallucinate a man in a red shirt again. It mistakes the toy car for a skateboard, probably by associating the wheels of the car to the wheels of a skateboard.
Now that we have a feeling for image captioning with neural networks, we move on to other interesting topics such as:
The source code of this project is freely available on github.