Audio-Based Distributional Representations of Meaning Using a Fusion of Feature Encodings

  title={Audio-Based Distributional Representations of Meaning Using a Fusion of Feature Encodings},
  author={Giannis Karamanolakis and Elias Iosif and Athanasia Zlatintsi and Aggelos Pikrakis and Alexandros Potamianos},
Recently a “Bag-of-Audio-Words” approach was proposed [1] for the combination of lexical features with audio clips in a multimodal semantic representation, i.e., an Audio Distributional Semantic Model (ADSM). An important step towards the creation of ADSMs is the estimation of the semantic distance between clips in the acoustic space, which is especially challenging given the diversity of audio collections. In this work, we investigate the use of different feature encodings in order to address… 

Figures and Tables from this paper

Audio-based Distributional Semantic Models for Music Auto-tagging and Similarity Measurement
Acoustic-semantic models are shown to outperform the state-of-the-art for this task and produce high quality tags for audio/music clips.
Sensory-Aware Multimodal Fusion for Word Semantic Similarity Estimation
This work estimates multimodal word representations via the fusion of auditory and visual modalities with the text modality through middle and late fusion of representations with modality weights assigned to each of the unimodal representations.
Analysis of Song/Artist Latent Features and Its Application for Song Search
This paper proposes two concepts of artist-song relationships: overall similarity and prominent affinity, and proposes three applications for song search that are beneficial for searching for songs according to the users' various search intents.
Query-by-Blending: A Music Exploration System Blending Latent Vector Representations of Lyric Word, Song Audio, and Artist
Query-by-Blending is a novel music exploration system that enables users to find unfamiliar music content by flexibly combining three musical aspects: lyric word, song audio, and artist by constructing a novel vector space model.


Sound-based distributional models
The first results of the efforts to build a perceptually grounded semantic model based on sound data collected from show that the models are able to capture semantic relatedness, with the tag- based model scoring higher than the sound-based model and the combined model.
Bag-of-Audio-Words Approach for Multimedia Event Classification
Variations of the BoAW method are explored and results on NIST 2011 multimedia event detection (MED) dataset are presented.
Multi-Tasking with Joint Semantic Spaces for Large-Scale Music Annotation and Retrieval
A method is proposed which attempts to capture the semantic similarities between the database items by modelling audio, artist names, and tags in a single low-dimensional semantic embedding space by optimizing the set of prediction tasks of interest jointly using multi-task learning.
Coherent bag-of audio words model for efficient large-scale video copy detection
This paper attempts to tackle the video copy detection task resorting to audio information, which is equivalently important as well as visual information in multimedia processing, and proposes a bag-of audio words (BoA) representation to characterize each audio frame.
Audio retrieval by latent perceptual indexing
A query-by-example audio retrieval framework by indexing audio clips in a generic database as points in a latent perceptual space, which reveals that the system performance is comparable to other proposed methods.
Mixtures of probability experts for audio retrieval and indexing
  • M. Slaney
  • Computer Science
    Proceedings. IEEE International Conference on Multimedia and Expo
  • 2002
This paper describes the conversion of audio and semantic data into their respective vector spaces and two different mixture-of-probability-expert models are trained to learn the association between acoustic queries and the corresponding semantic explanation.
Automatically Adapting the Structure of Audio Similarity Spaces
The results show that the proposed techniques clearly improve the qual- ity of this audio similarity measure, and preliminary experi- ments indicate that the techniques also help to improve other similarity measures.
Audio Information Retrieval using Semantic Similarity
We improve upon query-by-example for content-based audio information retrieval by ranking items in a database based on semantic similarity, rather than acoustic similarity, to a query example. The
Multimodal Distributional Semantics
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Feature Selection and Stacking for Robust Discrimination of Speech, Monophonic Singing, and Polyphonic Music
In this work we strive to find an optimal set of acoustic features for the discrimination of speech, monophonic singing, and polyphonic music to robustly segment acoustic media streams for annotation