Every Picture Tells a Story: Generating Sentences from Images

  title={Every Picture Tells a Story: Generating Sentences from Images},
  author={Ali Farhadi and Mohsen Hejrati and Mohammad Amin Sadeghi and Peter Young and Cyrus Rashtchian and J. Hockenmaier and David A. Forsyth},
  booktitle={European Conference on Computer Vision},
Humans can prepare concise descriptions of pictures, focusing on what they find important. [] Key Method The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned us-ingdata. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated…

Automatic sentence generation from images

This work proposes a novel approach to generate a sentential caption for the input image by summarizing captions of images that are similar to an input image and demonstrates that the proposed method can generate sentential captions.

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)

This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

A Hierarchical Approach for Generating Descriptive Image Paragraphs

A model that decomposes both images and paragraphs into their constituent parts is developed, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language.

Understanding images with natural sentences

The proposed system uses general images and captions available on the web as training data to generate captions for new images and has scalability, which makes it useful with large-scale data.

Choosing Linguistics over Vision to Describe Images

This paper addresses the problem of automatically generating human-like descriptions for unseen images, given a collection of images and their corresponding human-generated descriptions, and presents a generic method which benefits from all three sources simultaneously, and is capable of constructing novel descriptions.

Large Scale Retrieval and Generation of Image Descriptions

The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.

DISCO: Describing Images Using Scene Contexts and Objects

In this paper, we propose a bottom-up approach to generating short descriptive sentences from images, to enhance scene understanding. We demonstrate automatic methods for mapping the visual

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Grounded Compositional Semantics for Finding and Describing Images with Sentences

The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.

Composing Simple Image Descriptions using Web-scale N-grams

A simple yet effective approach to automatically compose image descriptions given computer vision based inputs and using web-scale n-grams, which indicates that it is viable to generate simple textual descriptions that are pertinent to the specific content of an image, while permitting creativity in the description -- making for more human-like annotations than previous approaches.

Clustering art

This work extends a recently developed method for learning the semantics of image databases using text and pictures and uses WordNet to provide semantic grouping information and to help disambiguate word senses, as well as emphasize the hierarchical nature of semantic relationships.

Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos

An Integer Programming framework for action recognition and storyline extraction using the storyline model and visual groundings learned from training data is formulated.

Towards total scene understanding: Classification, annotation and segmentation in an automatic framework

A fully automatic learning framework that is able to learn robust scene models from noisy Web data such as images and user tags from Flickr.com that significantly outperforms state-of-the-art algorithms.

Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

A joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus can be used to recognize people and actions in novel images without captions.

I2T: Image Parsing to Text Description

An image parsing to text description (I2T) framework that generates text descriptions of image and video content based on image understanding and uses automatic methods to parse image/video in specific domains and generate text reports that are useful for real-world applications.

WordsEye: an automatic text-to-scene conversion system

The linguistic analysis and depiction techniques used by WordsEye are described along with some general strategies by which more abstract concepts are made depictable.

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

This work shows how to cluster words that individually are difficult to predict into clusters that can be predicted well, and cannot predict the distinction between train and locomotive using the current set of features, but can predict the underlying concept.

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

Collecting Image Annotations Using Amazon’s Mechanical Turk

It is found that the use of a qualification test provides the highest improvement of quality, whereas refining the annotations through follow-up tasks works rather poorly.

Improving People Search Using Query Expansions

To improve search results by exploiting faces of other people that co-occur frequently with the queried person, query expansion is used to provide a query-specific relevant set of `negative' examples which should be separated from the potentially positive examples in the text-based result set.