Visually Grounded Meaning Representations

  title={Visually Grounded Meaning Representations},
  author={Carina Silberer and Vittorio Ferrari and Mirella Lapata},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
In this paper we address the problem of grounding distributional representations of lexical meaning. We introduce a new model which uses stacked autoencoders to learn higher-level representations from textual and visual input. The visual modality is encoded via vectors of attributes obtained automatically from images. We create a new large-scale taxonomy of 600 visual attributes representing more than 500 concepts and 700 K images. We use this dataset to train attribute classifiers and… 

Figures and Tables from this paper

Learning grounded word meaning representations on similarity graphs
Experimental results validate the ability of HM-SGE to simulate human similarity judgments and concept categorization, outperforming the state of the art.
A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images
A novel probabilistic model is proposed to formalize how linguistic and perceptual inputs can work in concert to explain the observed word-context pairs in a text corpus and attains equally competitive or stronger results when compared to other state-of-the-art multimodal models.
Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge
This paper combines multimodal information from both text and image-based representations derived from state-of-the-art distributional models to produce sparse, interpretable vectors using Joint Non-Negative Sparse Embedding and demonstrates their ability to predict interpretable linguistic descriptions of human ground-truth semantic knowledge.
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search
Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of the authors' physical world accessed through image search, is introduced and it is shown that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.
Learning Visually Grounded and Multilingual Representations
A novel computational model of cross-situational word learning is proposed that takes images of natural scenes paired with their descriptions as input and incrementally learns probabilistic associations between words and image features.
Using Grounded Word Representations to Study Theories of Lexical Concepts
The fields of cognitive science and philosophy have proposed many different theories for how humans represent “concepts”. Multiple such theories are compatible with state-of-the-art NLP methods, and
Mapping Distributional Semantics to Property Norms with Deep Neural Networks
A neural-network-based method for mapping a distributional semantic space onto a human-built property space automatically and evaluates its method on word embeddings learned with different types of contexts, and reports state-of-the-art performances on the widely used McRae semantic feature production norms.
Social Image Tags as a Source of Word Embeddings: A Task-oriented Evaluation
Social image tags can be utilized as yet another source of visually enforced features, provided the amount of available tags is large enough and the generated embeddings could be effective in discriminating synonyms and antonyms, which has been an issue in distributional hypothesis-based approaches.
Limitations of Cross-Lingual Learning from Image Search
This work investigates whether the meaning of other parts-of-speech (POS), in particular adjectives and verbs, can be learned in the same way as simple nouns.
A Neurobiologically Motivated Analysis of Distributional Semantic Models
  • A. Utsumi
  • Psychology, Computer Science
  • 2018
The analysis demonstrates that social and cognitive information is better encoded in text-based word vectors, but emotional information is not, and this result is discussed in terms of embodied theories for abstract concepts.


Models of Semantic Representation with Visual Attributes
This work creates a new large-scale taxonomy of visual attributes covering more than 500 concepts and their corresponding 688K images and shows that these bimodal models give a better fit to human word association data compared to amodal model and word representations based on handcrafted norming data.
Multimodal Distributional Semantics
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Distributional Semantics in Technicolor
While visual models with state-of-the-art computer vision techniques perform worse than textual models in general tasks, they are as good or better models of the meaning of words with visual correlates such as color terms, even in a nontrivial task that involves nonliteral uses of such words.
Grounded Compositional Semantics for Finding and Describing Images with Sentences
The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
Combining Language and Vision with a Multimodal Skip-gram Model
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Perceptual Inference Through Global Lexical Similarity
A model that uses the global structure of memory to exploit the redundancy between language and perception in order to generate inferred perceptual representations for words with which the model has no perceptual experience is proposed.
Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean
This work presents a new means of extending the scope of multi-modal models to more commonly-occurring abstract lexical concepts via an approach that learns multimodal embeddings, and outperforms previous approaches in combining input from distinct modalities.
Topics in semantic representation.
This article analyzes the abstract computational problem underlying the extraction and use of gist, formulating this problem as a rational statistical inference that leads to a novel approach to semantic representation in which word meanings are represented in terms of a set of probabilistic topics.
Acquiring Human-like Feature-Based Conceptual Representations from Corpora
This work introduces a novel method that extracts candidate triples from parsed data and re-ranks them using semantic information and demonstrates the utility of external knowledge in guiding feature extraction, and suggests a number of avenues for future work.
Semi-supervised learning of compact document representations with deep networks
An algorithm to learn text document representations based on semi-supervised autoencoders that are stacked to form a deep network that can be trained efficiently on partially labeled corpora, producing very compact representations of documents, while retaining as much class information and joint word statistics as possible.