• Corpus ID: 63426819

Deep embodiment: grounding semantics in perceptual modalities

@inproceedings{Kiela2017DeepEG,
  title={Deep embodiment: grounding semantics in perceptual modalities},
  author={Douwe Kiela},
  year={2017}
}
Multi-modal distributional semantic models address the fact that text-based semantic models, which represent word meanings as a distribution over other words, suffer from the grounding problem. This thesis advances the field of multi-modal semantics in two directions. First, it shows that transferred convolutional neural network representations outperform the traditional bag of visual words method for obtaining visual features. It is then shown that these representations may be applied… 
Sensory-Aware Multimodal Fusion for Word Semantic Similarity Estimation
TLDR
This work estimates multimodal word representations via the fusion of auditory and visual modalities with the text modality through middle and late fusion of representations with modality weights assigned to each of the unimodal representations.
A Computational Study on Word Meanings and Their Distributed Representations via Polymodal Embedding
TLDR
This paper examines a relationship between a true word meaning and its distributed representation via polymodal embedding approach inspired by the theory that humans tend to use diverse sources in developing a word meaning.
Learning Visually Grounded Sentence Representations
TLDR
In this work, grounded sentence representations are investigated, where a sentence encoder is trained to predict the image features of a given caption and use the resultant features as sentence representations.
Countering Language Drift via Visual Grounding
TLDR
It is shown that a combination of syntactic (language model likelihood) and semantic (visual grounding) constraints gives the best communication performance, allowing pre-trained agents to retain English syntax while learning to accurately convey the intended meaning.
Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods
TLDR
This survey provides a systematic overview of the research progress on the faithfulness problem of NLG, including problem analysis, evaluation metrics and optimization methods, and organizes the evaluation and optimized methods for different tasks into a unified taxonomy to facilitate comparison and learning across tasks.
Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction
TLDR
Channel-Exchanging-Network (CEN) is proposed which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning and dynamically exchanges channels between subnetworks of different modalities.
Talk the Walk: Navigating New York City through Grounded Dialogue
TLDR
This work focuses on the task of tourist localization and develops the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, and shows it yields significant improvements for both emergent and natural language communication.
Deep Multimodal Fusion by Channel Exchanging
TLDR
Channel-Exchanging-Network is proposed, a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities that is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training.
Artificial Moral Agents Within an Ethos of AI4SG
As artificial intelligence (AI) continues to proliferate into every area of modern life, there is no doubt that society has to think deeply about the potential impact, whether negative or positive,
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
TLDR
This paper proposes an end-to-end unifiedmodal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus, and builds a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.
...
1
2
...

References

SHOWING 1-10 OF 262 REFERENCES
Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean
TLDR
This work presents a new means of extending the scope of multi-modal models to more commonly-occurring abstract lexical concepts via an approach that learns multimodal embeddings, and outperforms previous approaches in combining input from distinct modalities.
Combining Language and Vision with a Multimodal Skip-gram Model
TLDR
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Systematically Grounding Language through Vision in a Deep, Recurrent Neural Network
TLDR
A deep, recurrent neural network is taught to ground language in a micro-world and exhibits strong systematicity, recovering appropriate meanings even for novel objects and descriptions, fulfilling an important prerequisite of general machine intelligence.
Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
TLDR
This work presents a simple approach to cross-modal vector-based semantics for the task of zero-shot learning, in which an image of a previously unseen object is mapped to a linguistic representation denoting its word.
Multimodal Distributional Semantics
TLDR
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Redundancy in Perceptual and Linguistic Experience: Comparing Feature-Based and Distributional Models of Semantic Representation
TLDR
It is argued that the amount of perceptual and other semantic information that can be learned from purely distributional statistics has been underappreciated and that future focus should be on understanding the cognitive mechanisms humans use to integrate the two sources.
Grounded Models of Semantic Representation
TLDR
Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
TLDR
This work improves a two-dimensional multimodal version of Latent Dirichlet Allocation and presents a novel way to integrate visual features into the LDA model using unsupervised clusters of images and provides two novel ways to extend the bimodal models to support three or more modalities.
Grounded Compositional Semantics for Finding and Describing Images with Sentences
TLDR
The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
...
1
2
3
4
5
...