Audio Retrieval with Natural Language Queries

@inproceedings{Oncescu2021AudioRW,
  title={Audio Retrieval with Natural Language Queries},
  author={Andreea-Maria Oncescu and A. Sophia Koepke and Jo{\~a}o F. Henriques and Zeynep Akata and Samuel Albanie},
  booktitle={Interspeech},
  year={2021}
}
We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce chal-lenging new benchmarks for text-based audio retrieval using text annotations sourced from the A UDIO C APS and C LOTHO datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our… 

Figures and Tables from this paper

Audio Retrieval with Natural Language Queries: A Benchmark Study

This work employs three challenging new benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, and introduces the SOUNDDESCS benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AUDIOCAPS and CLOTHO.

Audio Retrieval with WavText5K and CLAP Training

A new collection of web audio-text pairs and a new framework for retrieval that learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective.

Audio-Text Retrieval in Context

This work uses pre-trained audio features and a descriptor-based aggregation method to build a contextual audio-text retrieval system and observes that semantic mapping is more important than temporal relations in contextual retrieval.

Improving Natural-Language-Based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

This work uses pretrained embedding models to project recordings and textual descriptions into a shared audio-caption space in which related examples from dif-ferent modalities are close and shows that the used augmentations strategies reduce overfitting and improve retrieval performance.

Language-Based Audio Retrieval Task in DCASE 2022 Challenge

The out-come of Subtask 6B in terms of submitted systems' performance and analysis is presented, which focuses on ranking audio signals according to their relevance to natural language textual captions.

Automated Audio Captioning and Language-Based Audio Retrieval

This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval, and the architecture for Automated audio Captioning is close to the baseline performance, while the model for Language- based audio Retrival has surpassed its counterpart.

Contrastive Audio-Language Learning for Music

This work proposes MusCALL, a framework for Music Contrastive Audio-Language Learning, a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio- to-text retrieval out-of-the-box.

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

A content-based audio retrieval method that can retrieve a target audio that is similar to but slightly different from the query audio by introducing auxiliary textual information which describes the difference between the query and target audio is proposed.

Separate What You Describe: Language-Queried Audio Source Separation

This paper proposes LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture.

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

This paper proposes a novel metric based on Text-to-Audio Grounding ( TAG), which is useful for evaluating cross modal tasks like AAC, and shows its evaluation metric to perform better compared to existing metrics used in NL text and image captioning literature.

References

SHOWING 1-10 OF 71 REFERENCES

Large-scale content-based audio retrieval from text queries

A machine learning approach for retrieving sounds that is novel in that it uses free-form text queries rather sound sample based queries, searches by audio content rather than via textual meta data, and can scale to very large number of audio documents and very rich query vocabulary.

Cross Modal Audio Search and Retrieval with Joint Embeddings Based on Text and Audio

This work proposes a framework that learns joint embeddings from a shared lexico-acoustic space, where vectors from either modality can be mapped together and compared directly and improves semantic knowledge and enable the use of either text or audio queries to search and retrieve audio.

Use What You Have: Video retrieval using representations from collaborative experts

This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.

AudioCaps: Generating Captions for Audios in The Wild

A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.

QUERYD: A Video Dataset with High-Quality Text and Audio Narrations

The QuerYD dataset is introduced, a new large-scale dataset for retrieval and event localisation in video that is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Experimental results show that the proposed method has succeeded to use a pre- trained language model for audio captioning, and the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.

Audio Caption: Listen and Tell

  • Mengyue WuHeinrich DinkelKai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.

Content-Based Representations of Audio Using Siamese Neural Networks

A novel approach is proposed which encodes the audio into a vector representation using Siamese Neural Networks to obtain an encoding similar for files belonging to the same audio class, thus allowing retrieval of semantically similar audio.

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Audio-Visual-Based Query by Example Video Retrieval

In the proposed method, a two-step method for query by example video retrieval, a set of audio and visual features are extracted from the shot level and key frame level and applied to refine retrieval.
...