Deep Visual-Semantic Alignments for Generating Image Descriptions

  title={Deep Visual-Semantic Alignments for Generating Image Descriptions},
  author={Andrej Karpathy and Li Fei-Fei},
  journal={2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks (RNN) over sentences, and a structured objective that aligns the two modalities through a multimodal… CONTINUE READING
This paper has been referenced on Twitter 7 times. VIEW TWEETS


Publications referenced by this paper.
Showing 1-10 of 60 references

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

  • T. Tieleman, G. E. Hinton
  • 2012
Highly Influential
2 Excerpts