Cross Modal Audio Search and Retrieval with Joint Embeddings Based on Text and Audio

@article{Elizalde2019CrossMA,
  title={Cross Modal Audio Search and Retrieval with Joint Embeddings Based on Text and Audio},
  author={Benjamin Elizalde and Shuayb Zarar and Bhiksha Raj},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={4095-4099}
}
  • Benjamin Elizalde, Shuayb Zarar, B. Raj
  • Published 29 April 2019
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Existing audio search engines use one of two approaches: matching text-text or audio-audio pairs. In the former, text queries are matched to semantically similar words in an index of audio metadata to retrieve corresponding audio clips or segments, while in the latter, audio signals are directly used to retrieve acoustically-similar recordings from an audio database. However, independent treatment of text and audio has precluded information exchange between the two modalities. This is a problem… 

Figures and Tables from this paper

Audio Retrieval with Natural Language Queries: A Benchmark Study
TLDR
This work employs three challenging new benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, and introduces the SOUNDDESCS benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AUDIOCAPS and CLOTHO.
Unsupervised Audio-Caption Aligning Learns Correspondences Between Individual Sound Events and Textual Phrases
TLDR
Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.
Audio Retrieval with Natural Language Queries
TLDR
This work introduces challenging new benchmarks for text-based audio retrieval using text annotations sourced from the AUDIOCAPS and CLOTHO datasets and employs these benchmarks to establish baselines for cross-modal audio retrieval, where the benefits of pre-training on diverse audio tasks are demonstrated.
Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix
TLDR
This work proposes different Deep Learning architectures with a Siamesestructure and a Pairwise Presence Matrix for sound event retrieval, aimed at finding audio samples similar to an audio query based on their acoustic or semantic content.
Learning Audio-Video Modalities from Image Captions
TLDR
A new video mining pipeline is proposed which involves transferring captions from image captioning datasets to video clips with no additional manual effort, and it is shown that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning.
Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events
TLDR
An Audio-Grounding dataset is contributed, which provides the correspondence be-tween sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event.
Emotion Embedding Spaces for Matching Music to Stories
TLDR
The goal is to help creators find music to match the emotion of their story by leveraging data-driven embeddings on text-based stories that can be auralized, use multiple sentences as input queries, and automatically retrieve matching music.
T-EMDE: Sketching-based global similarity for cross-modal retrieval
TLDR
T-EMDE is a drop-in replacement for the self-attention module, with beneficial influence on both speed and metric performance in cross-modal settings, as each global text/image representation is expressed with a standardized sketch histogram which represents the same manifold structures irrespective of the underlying modality.
Semantically Meaningful Attributes from Co-Listen Embeddings for Playlist Exploration and Expansion
TLDR
This work examines the relative performance of these two embedding spaces (the co-listen–audio embedding and the attribute embedding) for the mathematical separation of thematic playlists and reports on the usefulness of recommendations from the attributeembedding space to human curators for automatically extending thematicplaylists.
Multimodal Matching Transformer for Live Commenting
TLDR
This work proposes a multimodal matching transformer to capture the relationships among comments, vision, and audio based on the transformer framework and can iteratively learn the attention-aware representations for each modality.
...
1
2
...

References

SHOWING 1-10 OF 20 REFERENCES
Large-scale content-based audio retrieval from text queries
TLDR
A machine learning approach for retrieving sounds that is novel in that it uses free-form text queries rather sound sample based queries, searches by audio content rather than via textual meta data, and can scale to very large number of audio documents and very rich query vocabulary.
Content-Based Representations of Audio Using Siamese Neural Networks
TLDR
A novel approach is proposed which encodes the audio into a vector representation using Siamese Neural Networks to obtain an encoding similar for files belonging to the same audio class, thus allowing retrieval of semantically similar audio.
An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges
TLDR
An overview of cross-media retrieval is given, including the concepts, methodologies, major challenges, and open issues, as well as building up the benchmarks, including data sets and experimental results so that researchers can directly adopt the benchmarks to promptly evaluate their proposed methods.
A Closer Look at Weak Label Learning for Audio Events
TLDR
This work describes a CNN based approach for weakly supervised training of audio events and describes important characteristics, which naturally arise inweakly supervised learning of sound events, and shows how these aspects of weak labels affect the generalization of models.
Segmentation, Indexing, and Retrieval for Environmental and Natural Sounds
TLDR
A dynamic Bayesian network (DBN) is presented that jointly infers onsets and end times of the most prominent sound events in the space, along with an extension of the algorithm for covering large spaces with distributed microphone arrays.
See, Hear, and Read: Deep Aligned Representations
TLDR
This work utilizes large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language, and jointly train a deep convolutional network for aligned representation learning.
Learning Deep Structure-Preserving Image-Text Embeddings
This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is
NELS - Never-Ending Learner of Sounds
TLDR
This work introduces the Never-Ending Learner of Sounds (NELS), a project for continuously learning of sounds and their associated knowledge, and proposes a system that continuously learns from the web relations between sounds and language.
Opensmile: the munich versatile and fast open-source audio feature extractor
TLDR
The openSMILE feature extraction toolkit is introduced, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities and has a modular, component based architecture which makes extensions via plug-ins easy.
Multimodal Similarity-Preserving Hashing
TLDR
An efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable, based on a novel coupled siamese neural network architecture.
...
1
2
...