An Empirical Study of Language CNN for Image Captioning

@article{Gu2017AnES,
  title={An Empirical Study of Language CNN for Image Captioning},
  author={Jiuxiang Gu and G. Wang and Jianfei Cai and Tsuhan Chen},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1231-1240}
}
  • Jiuxiang Gu, G. Wang, Tsuhan Chen
  • Published 21 December 2016
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
Language models based on recurrent neural networks have dominated recent image caption generation tasks. In this paper, we introduce a language CNN model which is suitable for statistical language modeling tasks and shows competitive performance in image captioning. In contrast to previous models which predict next word based on one previous word and hidden state, our language CNN is fed with all the previous words and can model the long-range dependencies in history words, which are critical… 

Figures and Tables from this paper

Long-Term Recurrent Merge Network Model for Image Captioning
TLDR
A Long-term Recurrent Merge Network (LRMN) model is proposed to merge the image feature at each step via a language model, which not only can improve the accuracy of image captioning, but also can describe the image better.
A Survey on Image Encoders and Language Models for Image Captioning
TLDR
The image encoders and language models used by the state-of-the-art image captioning models is discussed and convolutional neural network {CNN] is applied.
A Survey on Image Captioning datasets and Evaluation Metrics
TLDR
Various datasets and evaluation metrics which are useful for image captioning task are discussed and the datasets and Evaluation metrics applied by the state-of-the-art image captioned models are summarized.
Image Captioning Using R-CNN & LSTM Deep Learning Model
TLDR
A Fully Convolution Localization Network is proposed that processes a picture with a single forward pass which can be consistently trained in a single round of optimization.
Attention-Based Deep Learning Model for Image Captioning: A Comparative Study
TLDR
This paper proposes the comparative study for attention-based deep learning model for image captioning, and presents the basic analyzing techniques for performance, advantages, and weakness.
Survey of convolutional neural networks for image captioning
TLDR
A survey of models using CNN for image embedding and RNN for language modeling and prediction, and the advantage that make this section worth exploring is provided.
A Multi-task Learning Approach for Image Captioning
TLDR
The experimental results demonstrate that the proposed Multi-task Learning Approach for Image Captioning achieves impressive results compared to other strong competitors.
A Hybridized Deep Learning Method for Bengali Image Captioning
TLDR
A standard strategy for Bengali image caption generation on two different sizes of the Flickr8k dataset and BanglaLekha dataset which is the only publicly available Bengali dataset for image captioning is proffer.
Recurrent Fusion Network for Image Captioning
TLDR
This paper proposes a novel recurrent fusion network (RFNet) for the image captioning task, which can exploit the interactions among the outputs of the image encoders and generate new compact and informative representations for the decoder.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
TLDR
The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
A Convolutional Architecture for Word Sequence Prediction
TLDR
It is argued that the proposed convolutional neural network, named genCNN, can give adequate representation of the history, and therefore can naturally exploit both the short and long range dependencies.
genCNN: A Convolutional Architecture for Word Sequence Prediction
TLDR
It is argued that the proposed novel convolutional architecture, named $gen$CNN, can give adequate representation of the history, and therefore can naturally exploit both the short and long range dependencies.
Phrase-based Image Captioning
TLDR
This paper presents a simple model that is able to generate descriptive sentences given a sample image and proposes a simple language model that can produce relevant descriptions for a given test image using the phrases inferred.
A Convolutional Neural Network for Modelling Sentences
TLDR
A convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) is described that is adopted for the semantic modelling of sentences and induces a feature graph over the sentence that is capable of explicitly capturing short and long-range relations.
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
TLDR
A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented.
Guiding the Long-Short Term Memory Model for Image Caption Generation
In this work we focus on the problem of image caption generation. We propose an extension of the long short term memory (LSTM) model, which we coin gLSTM for short. In particular, we add semantic
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
TLDR
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
...
...