Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception

@inproceedings{Kiela2015MultiAC,
  title={Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception},
  author={Douwe Kiela and Stephen Clark},
  booktitle={EMNLP},
  year={2015}
}
Multi-modal semantics has relied on feature norms or raw image data for perceptual input. [] Key Result To our knowledge, this is the first work to combine linguistic and auditory information into multi-modal representations.

Figures and Tables from this paper

Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception
TLDR
This paper examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, and shows how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments.
Sensory-Aware Multimodal Fusion for Word Semantic Similarity Estimation
TLDR
This work estimates multimodal word representations via the fusion of auditory and visual modalities with the text modality through middle and late fusion of representations with modality weights assigned to each of the unimodal representations.
Optimizing Visual Representations in Semantic Multi-modal Models with Dimensionality Reduction, Denoising and Contextual Information
This paper improves visual representations for multi-modal semantic models, by (i) applying standard dimensionality reduction and denoising techniques, and by (ii) proposing a novel technique \(
Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning
TLDR
A visually grounded language model capable of embedding concepts in a visually grounded semantic space that enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations.
Crossmodal Network-Based Distributional Semantic Models
TLDR
This work proposes the crossmodal extension of a two-tier text-based model, where semantic representations are encoded in the first layer, while the second layer is used for computing similarity between words.
Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy"
TLDR
This paper builds perceptual models that use haptic, auditory, and proprioceptive data acquired through robot exploratory behaviors to go beyond vision to ground natural language words describing objects using supervision from an interactive humanrobot "I Spy" game.
Associative Multichannel Autoencoder for Multimodal Word Representation
TLDR
A novel associative multichannel autoencoder (AMA) that first learns the associations between textual and perceptual modalities, so as to predict the missing perceptual information of concepts and fused through reconstructing their original and associated embeddings.
Investigating Inner Properties of Multimodal Representation and Semantic Compositionality with Brain-based Componential Semantics
TLDR
Light is shed on the fundamental questions of natural language understanding, such as how to represent the meaning of words and how to combine word meanings into larger units by proposing simple interpretation methods based on brain-based componential semantics.
Multimodal Visual and Simulated Muscle Activations for Grounded Semantics of Hand-related Descriptions
TLDR
The Words-as-Classifiers model of grounded semantics is applied to learn a mapping between features from the two modalities and corresponding hand image descriptions, showing that a multimodal fusion of both visual and muscle features yields improved results for the model than either of the modalities alone in image and description retrieval tasks.
Guiding Interaction Behaviors for Multi-modal Grounded Language Learning
TLDR
This work gathers behavior annotations from humans and demonstrates that these improve language grounding performance by allowing a system to focus on relevant behaviors for words like “white” or “half-full” that can be understood by looking or lifting, respectively.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
Grounding Semantics in Olfactory Perception
TLDR
This is the first work to evaluate semantic similarity on representations grounded in olfactory data through the construction of a novel bag of chemical compounds model, and uses standard evaluations for multi-modal semantics.
Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More
TLDR
An unsupervised method to determine whether to include perceptual input for a concept is proposed, and it is shown that it significantly improves the ability of multi-modal models to learn and represent word meanings.
Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
TLDR
This work presents a simple approach to cross-modal vector-based semantics for the task of zero-shot learning, in which an image of a previously unseen object is mapped to a linguistic representation denoting its word.
Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics
We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers
Multimodal Distributional Semantics
TLDR
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Combining Language and Vision with a Multimodal Skip-gram Model
TLDR
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean
TLDR
This work presents a new means of extending the scope of multi-modal models to more commonly-occurring abstract lexical concepts via an approach that learns multimodal embeddings, and outperforms previous approaches in combining input from distinct modalities.
A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
TLDR
This work improves a two-dimensional multimodal version of Latent Dirichlet Allocation and presents a novel way to integrate visual features into the LDA model using unsupervised clusters of images and provides two novel ways to extend the bimodal models to support three or more modalities.
Learning Words from Images and Speech
TLDR
This work explores the possibility of learning both an acoustic model and a word/image association from multi-modal co-occurrences between speech and pictures alone (a task known as cross-situational learning), inspired by the observation that infants achieve spontaneously this kind of correspondence during their first year of life.
Grounded Models of Semantic Representation
TLDR
Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
...
1
2
3
4
...