Stacked RNNs for Encoder-Decoder Networks: Accurate Machine Understanding of Images

Abstract

We address the image captioning task by combining a convolutional neural network (CNN) with various recurrent neural network architectures. We train the models on over 400,000 training examples ( roughly 80,000 images, with 5 captions per image) from the Microsoft 2014 COCO challenge. We demonstrate that stacking a 2-Layer RNN provides better results on image captioning tasks than both a Vanilla LSTM and a Vanilla RNN.

12 Figures and Tables

Cite this paper

@inproceedings{Lambert2016StackedRF, title={Stacked RNNs for Encoder-Decoder Networks: Accurate Machine Understanding of Images}, author={John Lambert}, year={2016} }