Show and tell: A neural image caption generator

@article{Vinyals2015ShowAT,
  title={Show and tell: A neural image caption generator},
  author={Oriol Vinyals and Alexander Toshev and Samy Bengio and D. Erhan},
  journal={2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2015},
  pages={3156-3164}
}
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. [...] Key Method The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively…Expand
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
TLDR
A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented.
Fast image captioning using LSTM
TLDR
A generative automatic image annotation model that makes use of a deep-convolutional neural network to detect image regions, which later will be fed to recurrent neural network that is trained to maximize the likely-hood of the target sentence description of the given image.
Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder
TLDR
A two-stream attention mechanism that can automatically discover latent categories and relate them to image regions based on the previously generated words and a regularization technique that encapsulates the syntactic and semantic structure of captions and improves the optimization of the image captioning model are proposed.
Image Captioning Techniques
In the past decade, the need of generating descriptive sentences automatically for images has generated a rising interest in natural language processing and computer vision research. Image captioning
Phrase-based Image Captioning
TLDR
This paper presents a simple model that is able to generate descriptive sentences given a sample image and proposes a simple language model that can produce relevant descriptions for a given test image using the phrases inferred.
Fine-grained attention for image caption generation
TLDR
This paper presents a fine-grained attention based model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation that is able to automatically learn to fix its gaze on salient region proposals.
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Image Captioning using Deep Learning
TLDR
A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is being used and will be trained to maximize the likelihood of the target description sentence given the training image.
Image Caption Generation with Part of Speech Guidance
TLDR
Experimental results on the most popular benchmark datasets, e.g., Flickr30k and MS COCO, consistently demonstrate that the method can significantly enhance the performance of a standard image caption generation model, and achieve the conpetitive results.
From Show to Tell: A Survey on Image Captioning
TLDR
This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 36 REFERENCES
Explain Images with Multimodal Recurrent Neural Networks
TLDR
The m-RNN model directly models the probability distribution of generating a word given previous words and the image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Grounded Compositional Semantics for Finding and Describing Images with Sentences
TLDR
The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
Every Picture Tells a Story: Generating Sentences from Images
TLDR
A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
TLDR
DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Neural Machine Translation by Jointly Learning to Align and Translate
TLDR
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
CIDEr: Consensus-based image description evaluation
TLDR
A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
Multimodal Neural Language Models
TLDR
This work introduces two multimodal neural language models: models of natural language that can be conditioned on other modalities and imagetext modelling, which can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees.
...
1
2
3
4
...