Grounded Compositional Semantics for Finding and Describing Images with Sentences

@article{Socher2014GroundedCS,
  title={Grounded Compositional Semantics for Finding and Describing Images with Sentences},
  author={Richard Socher and Andrej Karpathy and Quoc V. Le and Christopher D. Manning and A. Ng},
  journal={Transactions of the Association for Computational Linguistics},
  year={2014},
  volume={2},
  pages={207-218}
}
  • R. Socher, A. Karpathy, +2 authors A. Ng
  • Published 2014
  • Computer Science
  • Transactions of the Association for Computational Linguistics
Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images. However, the sentence vectors of previous models cannot accurately represent visually grounded meaning. We introduce the DT-RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based models which use… Expand
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct theExpand
Learning Meaningful Sentence Embedding Based on Recursive Auto-encoders
TLDR
This work proposes a novel method to learn meaning representation for variable-sized sentence based on recursive auto-encoders in unsupervised manner and without using any parse or dependency tree. Expand
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
TLDR
This work introduces a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data and introduces a structured max-margin objective that allows this model to explicitly associate fragments across modalities. Expand
Phrase-based Image Captioning
TLDR
This paper presents a simple model that is able to generate descriptive sentences given a sample image and proposes a simple language model that can produce relevant descriptions for a given test image using the phrases inferred. Expand
Learning to Represent Image and Text with Denotation Graphs
TLDR
It is shown that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations, which lead to stronger empirical results on downstream tasks of cross-modal image retrieval, referring expression, and compositional attribute-object recognition. Expand
Combining Language and Vision with a Multimodal Skip-gram Model
TLDR
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning. Expand
Associating Images with Sentences Using Recurrent Canonical Correlation Analysis
TLDR
This model includes a contextual attention-based LSTM-RNN which can selectively attend to salient regions of an image at each time step, and then represent all the salient contents within a few steps, and focuses on the modelling of contextual visual attention mechanism for the task of association analysis. Expand
A Convolutional Neural Network for Modelling Sentences
TLDR
A convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) is described that is adopted for the semantic modelling of sentences and induces a feature graph over the sentence that is capable of explicitly capturing short and long-range relations. Expand
Deep visual-semantic alignments for generating image descriptions
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented. Expand
Expressing an Image Stream with a Sequence of Natural Sentences
TLDR
An approach for retrieving a sequence of natural sentences for an image stream that directly learns from vast user-generated resource of blog posts as text-image parallel training data and outperforms other state-of-the-art candidate methods. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 57 REFERENCES
Semantic Compositionality through Recursive Matrix-Vector Spaces
TLDR
A recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length and can learn the meaning of operators in propositional logic and natural language is introduced. Expand
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
TLDR
A max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences is introduced. Expand
Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks
Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness ofExpand
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. Expand
Every Picture Tells a Story: Generating Sentences from Images
TLDR
A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. Expand
A unified architecture for natural language processing: deep neural networks with multitask learning
We describe a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semanticExpand
Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
TLDR
This work introduces a method for paraphrase detection based on recursive autoencoders (RAE) and unsupervised RAEs based on a novel unfolding objective and learns feature vectors for phrases in syntactic trees to measure word- and phrase-wise similarity between two sentences. Expand
Distributional Memory: A General Framework for Corpus-Based Semantics
TLDR
The Distributional Memory approach is shown to be tenable despite the constraints imposed by its multi-purpose nature, and performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against several state-of-the-art methods. Expand
Automatic Caption Generation for News Images
AbtractThis thesis is concerned with the task of automatically generating captions for images, which is important for many image related applications. Our model learns to create captions fromExpand
Parsing with Compositional Vector Grammars
TLDR
A Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations and improves performance on the types of ambiguities that require semantic information such as PP attachments. Expand
...
1
2
3
4
5
...