Show and tell: A neural image caption generator

  title={Show and tell: A neural image caption generator},
  author={Oriol Vinyals and Alexander Toshev and Samy Bengio and D. Erhan},
  journal={2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. [] Key Method The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively…

Figures and Tables from this paper

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge

A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented.

Fast image captioning using LSTM

A generative automatic image annotation model that makes use of a deep-convolutional neural network to detect image regions, which later will be fed to recurrent neural network that is trained to maximize the likely-hood of the target sentence description of the given image.

Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder

A two-stream attention mechanism that can automatically discover latent categories and relate them to image regions based on the previously generated words and a regularization technique that encapsulates the syntactic and semantic structure of captions and improves the optimization of the image captioning model are proposed.

Image Captioning Techniques

A hybrid system that employs the use of multilayer Convolutional Neural Network to generate a vocabulary which describes images and Long Short-Term Memory for accurately structuring meaningful sentences using the generated keywords is proposed.

Phrase-based Image Captioning

This paper presents a simple model that is able to generate descriptive sentences given a sample image and proposes a simple language model that can produce relevant descriptions for a given test image using the phrases inferred.

Fine-grained attention for image caption generation

This paper presents a fine-grained attention based model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation that is able to automatically learn to fix its gaze on salient region proposals.

From captions to visual concepts and back

This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.

Image Captioning using Deep Learning

A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is being used and will be trained to maximize the likelihood of the target description sentence given the training image.

A deep learning approach for automatically generating descriptions of images containing people

The main objective of this project is to develop a Deep Learning model that automatically produces descriptions of images containing people and to conclude if it is a good practice the restriction to this kind of images.



Explain Images with Multimodal Recurrent Neural Networks

The m-RNN model directly models the probability distribution of generating a word given previous words and the image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Grounded Compositional Semantics for Finding and Describing Images with Sentences

The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.

Every Picture Tells a Story: Generating Sentences from Images

A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

Neural Machine Translation by Jointly Learning to Align and Translate

It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

CIDEr: Consensus-based image description evaluation

A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.

Multimodal Neural Language Models

This work introduces two multimodal neural language models: models of natural language that can be conditioned on other modalities and imagetext modelling, which can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees.

Composing Simple Image Descriptions using Web-scale N-grams

A simple yet effective approach to automatically compose image descriptions given computer vision based inputs and using web-scale n-grams, which indicates that it is viable to generate simple textual descriptions that are pertinent to the specific content of an image, while permitting creativity in the description -- making for more human-like annotations than previous approaches.