Representation Learning for the Automatic Indexing of Sound Effects Libraries

  title={Representation Learning for the Automatic Indexing of Sound Effects Libraries},
  author={Alison B. Ma and Alexander Lerch},
Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy up-dates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting prob-lem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overcome dataset-dependent limitations that inhibit the successful training of deep learning models… 

Figures from this paper



Towards Learning Universal Audio Representations

A holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains is introduced and a novel normalizer-free Slowfast NFNet is proposed to achieve state-of-the-art performance across all domains.

Contrastive Learning of Musical Representations

It is shown that CLMR’s representations are transferable using out-of-domain datasets, indicating that the method has strong generalisability in music classification and to foster reproducibility and future research on self-supervised learning in music, the models and source code are publicly released.

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.

Contrastive Learning of General-Purpose Audio Representations

This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.

Who Calls The Shots? Rethinking Few-Shot Learning for Audio

A series of experiments lead to audio-specific insights on few-shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size- fits-all model, method, and support set selection criterion, and it depends on the expected application scenario.

Few-Shot Continual Learning for Audio Classification

This work introduces a few-shot continual learning framework for audio classification, where a trained base classifier is continuously expanded to recognize novel classes based on only few labeled data at inference time, which enables fast and interactive model updates by end-users with minimal human effort.

Unsupervised Contrastive Learning of Sound Event Representations

This work proposes to use the pretext task of contrasting differently augmented views of sound events to suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels.

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

This paper demonstrates the power of a simple combination of two common SSL methods: consistency regularization and pseudo-labeling, and shows that FixMatch achieves state-of-the-art performance across a variety of standard semi-supervised learning benchmarks.

Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix

This work proposes different Deep Learning architectures with a Siamesestructure and a Pairwise Presence Matrix for sound event retrieval, aimed at finding audio samples similar to an audio query based on their acoustic or semantic content.

Supervised Contrastive Learning

A novel training methodology that consistently outperforms cross entropy on supervised learning tasks across different architectures and data augmentations is proposed, and the batch contrastive loss is modified, which has recently been shown to be very effective at learning powerful representations in the self-supervised setting.