• Corpus ID: 7335121

Learning Multi-Modal Word Representation Grounded in Visual Context

  title={Learning Multi-Modal Word Representation Grounded in Visual Context},
  author={{\'E}loi Zablocki and Benjamin Piwowarski and Laure Soulier and Patrick Gallinari},
Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to integrate perceptual and visual features. Most of these works consider the visual appearance of objects to enhance word representations but they ignore the visual environment and context in which objects appear. We propose to unify text-based techniques with vision… 

Figures and Tables from this paper

Accurate Word Representations with Universal Visual Guidance
A visual representation method to explicitly enhance conventional word embedding with multipleaspect senses from visual guidance is proposed and it is shown that the method substantially improves the accuracy of disambiguation.
Incorporating Visual Semantics into Sentence Representations within a Grounded Space
A model to transfer visual information to textual representations by learning an intermediate representation space: the grounded space is proposed and it is shown that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.
Probing Contextualized Sentence Representations with Visual Awareness
We present a universal framework to model contextualized sentence representations with visual awareness that is motivated to overcome the shortcomings of the multimodal parallel data with manual
MCSE: Multimodal Contrastive Learning of Sentence Embeddings
This work proposes a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective and shows that this model excels in aligning semantically similar sentences, providing an explanation for its improved performance.
Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation
A generalized advantage of multimodal representations over language- only ones on concrete word pairs, but not on abstract ones is observed, which confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images.
Learning Zero-Shot Multifaceted Visually Grounded Word Embeddings via Multi-Task Training
This paper argues that since concrete and abstract words are processed differently in the brain, such approaches sacrifice the abstract knowledge obtained from textual statistics in the process of acquiring perceptual information, and implicitly grounding the word embeddings is needed.
Integrate Image Representation to Text Model on Sentence Level: a Semi-supervised Framework
A novel semi-supervised visual integration framework for sentence level language representation that requires image database, and no extra alignment is required for training and prediction, provides an efficient and feasible method for multi-modality language learning.
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search
Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of the authors' physical world accessed through image search, is introduced and it is shown that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.
Image Captioning with Visual Object Representations Grounded in the Textual Modality
An improvement in structural correlation between the word embeddings and both original and projected object vectors suggests that the grounding is actually mutual.
Semi-supervised Visual Feature Integration for Language Models through Sentence Visualization
Considering that the proposed semi-supervised visual integration framework only requires an image database, and does not require aligned images for the processed texts, it provides a feasible way for multimodal language learning.


Combining Language and Vision with a Multimodal Skip-gram Model
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics
We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers
Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean
This work presents a new means of extending the scope of multi-modal models to more commonly-occurring abstract lexical concepts via an approach that learns multimodal embeddings, and outperforms previous approaches in combining input from distinct modalities.
Multimodal Distributional Semantics
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes
A model to learn visually grounded word embeddings (vis-w2v) to capture visual notions of semantic relatedness and shows improvements over text-only word embedDings (word2vec) on three tasks: common-sense assertion classification, visual paraphrasing and text-based image retrieval.
Deep Embedding for Spatial Role Labeling
This paper introduces the visually informed embedding of word (VIEW), a continuous vector representation for a word extracted from a deep neural model trained using the Microsoft COCO data set to
Imagined Visual Representations as Multimodal Embeddings
This paper presents a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations, providing a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory.
Distributional Semantics in Technicolor
While visual models with state-of-the-art computer vision techniques perform worse than textual models in general tasks, they are as good or better models of the meaning of words with visual correlates such as color terms, even in a nontrivial task that involves nonliteral uses of such words.
Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More
An unsupervised method to determine whether to include perceptual input for a concept is proposed, and it is shown that it significantly improves the ability of multi-modal models to learn and represent word meanings.
Learning Grounded Meaning Representations with Autoencoders
A new model is introduced which uses stacked autoencoders to learn higher-level embeddings from textual and visual input and which outperforms baselines and related models on similarity judgments and concept categorization.