Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception

@article{Kiela2017LearningNA,
  title={Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception},
  author={Douwe Kiela and Stephen Clark},
  journal={J. Artif. Intell. Res.},
  year={2017},
  volume={60},
  pages={1003-1030}
}
Multi-modal semantics, which aims to ground semantic representations in perception, has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics. After having shown the quality of such auditorily grounded representations, we show how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments… 
Learning Multimodal Word Representations by Explicitly Embedding Syntactic and Phonetic Information
TLDR
This article proposes an effective multimodal word representation model that uses two gate mechanisms to explicitly embed syntactic and phonetic information into multi-modal representations and uses supervised learning to train the model.
A Synchronized Word Representation Method With Dual Perceptual Information
TLDR
A language model is proposed that synchronously trains dual perceptual information to enhance word representation in a synchronized way that adopts an attention model to utilize both text and phonetic perceptual information in unsupervised learning tasks.
Improving Automated Segmentation of Radio Shows with Audio Embeddings
TLDR
It is found that a set-up including audio embeddings generated through a non-speech sound event classification task significantly outperforms the authors' text-only baseline by 32.3% in F1-measure and different classification tasks yield audio embedDings that vary in segmentation performance.
An Integrated Neural Decoder of Linguistic and Experiential Meaning
TLDR
Initial evidence that modeling nonlinguistic “experiential” knowledge contributes to decoding neural representations of sentence meaning is presented, and a model-based approach is presented that reveals early evidence that experiential and linguistically acquired knowledge can be detected in brain activity elicited in reading natural sentences.
Audiovisual, Genre, Neural and Topical Textual Embeddings for TV Programme Content Representation
TLDR
This work develops vectorial representations for low-level multimodal features of BBC TV programmes, and uses a standard recommender and pairwise similarity matrices of content vectors to estimate viewers' behaviours.
Uni- and Multimodal and Structured Representations for Modeling Frame Semantics
TLDR
The overall challenge of understanding meaning in language by capturing world knowledge is examined from the two branches of knowledge about situations and actions as expressed in texts and structured relational knowledge as stored in knowledge bases, addressed by the lexical-semantic knowledge base FrameNet.
Functional Distributional Semantics
TLDR
This work proposes a novel probabilistic framework which draws on both formal semantics and recent advances in machine learning, and describes an implementation of this framework using a combination of Restricted Boltzmann Machines and feedforward neural networks.
Enhanced Double-Carrier Word Embedding via Phonetics and Writing
TLDR
This work proposed double-carrier word embedding (DCWE), which can be applied to most languages, and selected Chinese, English, and Spanish as examples and evaluated these models through word similarity and text classification experiments.
Multimodal Grounding for Language Processing
TLDR
A methodological inventory of the information flow in multimodal processing with respect to cognitive models of human information processing is categorized and different methods for combining multi-modal representations are analyzed.
A Game Interface to Study Semantic Grounding in Text-Based Models
TLDR
If any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is out of the reach of text-based models, and early work on an online game for the collection of human judgments on the distributional similarity of word pairs in five languages is presented.
...
1
2
...

References

SHOWING 1-10 OF 65 REFERENCES
Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception
TLDR
This work examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, including measuring conceptual similarity and relatedness, through a zero-shot learning task mapping between linguistic and auditory modalities.
Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics
We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers
Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean
TLDR
This work presents a new means of extending the scope of multi-modal models to more commonly-occurring abstract lexical concepts via an approach that learns multimodal embeddings, and outperforms previous approaches in combining input from distinct modalities.
Combining Language and Vision with a Multimodal Skip-gram Model
TLDR
Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.
Sound-Word2Vec: Learning Word Representations Grounded in Sounds
TLDR
This work treats sound as a first-class citizen, studying downstream 6textual tasks which require aural grounding and proposes sound-word2vec – a new embedding scheme that learns specialized word embeddings grounded in sounds.
Sound-based distributional models
TLDR
The first results of the efforts to build a perceptually grounded semantic model based on sound data collected from freesound.org show that the models are able to capture semantic relatedness, with the tag- based model scoring higher than the sound-based model and the combined model.
Grounding Semantics in Olfactory Perception
TLDR
This is the first work to evaluate semantic similarity on representations grounded in olfactory data through the construction of a novel bag of chemical compounds model, and uses standard evaluations for multi-modal semantics.
A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
TLDR
This work improves a two-dimensional multimodal version of Latent Dirichlet Allocation and presents a novel way to integrate visual features into the LDA model using unsupervised clusters of images and provides two novel ways to extend the bimodal models to support three or more modalities.
Multimodal Distributional Semantics
TLDR
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Grounded Models of Semantic Representation
TLDR
Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
...
1
2
3
4
5
...