Grounded Compositional Semantics for Finding and Describing Images with Sentences

  title={Grounded Compositional Semantics for Finding and Describing Images with Sentences},
  author={Richard Socher and Andrej Karpathy and Quoc V. Le and Christopher D. Manning and A. Ng},
  journal={Transactions of the Association for Computational Linguistics},
  • R. Socher, A. Karpathy, A. Ng
  • Published 30 April 2014
  • Computer Science
  • Transactions of the Association for Computational Linguistics
Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images. However, the sentence vectors of previous models cannot accurately represent visually grounded meaning. We introduce the DT-RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based models which use… 

Skip-Thought Vectors

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the

Learning Meaningful Sentence Embedding Based on Recursive Auto-encoders

This work proposes a novel method to learn meaning representation for variable-sized sentence based on recursive auto-encoders in unsupervised manner and without using any parse or dependency tree.

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

This work introduces a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data and introduces a structured max-margin objective that allows this model to explicitly associate fragments across modalities.

Phrase-based Image Captioning

This paper presents a simple model that is able to generate descriptive sentences given a sample image and proposes a simple language model that can produce relevant descriptions for a given test image using the phrases inferred.

Combining Language and Vision with a Multimodal Skip-gram Model

Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.

Learning to Represent Image and Text with Denotation Graphs

It is shown that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations, which lead to stronger empirical results on downstream tasks of cross-modal image retrieval, referring expression, and compositional attribute-object recognition.

Associating Images with Sentences Using Recurrent Canonical Correlation Analysis

This model includes a contextual attention-based LSTM-RNN which can selectively attend to salient regions of an image at each time step, and then represent all the salient contents within a few steps, and focuses on the modelling of contextual visual attention mechanism for the task of association analysis.

A Convolutional Neural Network for Modelling Sentences

A convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) is described that is adopted for the semantic modelling of sentences and induces a feature graph over the sentence that is capable of explicitly capturing short and long-range relations.

Expressing an Image Stream with a Sequence of Natural Sentences

An approach for retrieving a sequence of natural sentences for an image stream that directly learns from vast user-generated resource of blog posts as text-image parallel training data and outperforms other state-of-the-art candidate methods.

Learning to Compose Spatial Relations with Grounded Neural Language Models

This paper examines how language models conditioned on perceptual features can capture the semantics of composed phrases as well as of individual words by evaluating how the learned language models can deal with novel grounded composed descriptions and novel grounded decomposed descriptions.



Semantic Compositionality through Recursive Matrix-Vector Spaces

A recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length and can learn the meaning of operators in propositional logic and natural language is introduced.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks

A max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences is introduced.

Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks

A recursive neural network architecture for jointly parsing natural language and learning vector space representations for variable-sized inputs and captures semantic information: For instance, the phrases “decline to comment” and “would not disclose the terms” are close by in the induced embedding space.

Every Picture Tells a Story: Generating Sentences from Images

A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.

A unified architecture for natural language processing: deep neural networks with multitask learning

We describe a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic

Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

This work introduces a method for paraphrase detection based on recursive autoencoders (RAE) and unsupervised RAEs based on a novel unfolding objective and learns feature vectors for phrases in syntactic trees to measure word- and phrase-wise similarity between two sentences.

Distributional Memory: A General Framework for Corpus-Based Semantics

The Distributional Memory approach is shown to be tenable despite the constraints imposed by its multi-purpose nature, and performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against several state-of-the-art methods.

Parsing with Compositional Vector Grammars

A Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations and improves performance on the types of ambiguities that require semantic information such as PP attachments.

Linguistic Regularities in Continuous Space Word Representations

The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.

I2T: Image Parsing to Text Description

An image parsing to text description (I2T) framework that generates text descriptions of image and video content based on image understanding and uses automatic methods to parse image/video in specific domains and generate text reports that are useful for real-world applications.