Who Calls The Shots? Rethinking Few-Shot Learning for Audio

  title={Who Calls The Shots? Rethinking Few-Shot Learning for Audio},
  author={Yu Wang and Nicholas J. Bryan and Justin Salamon and M. Cartwright and Juan Pablo Bello},
  journal={2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  • Yu WangNicholas J. Bryan J. Bello
  • Published 17 October 2021
  • Computer Science
  • 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlapping sounds, resulting in unique properties such as polyphony and signal-to-noise ratios (SNR). This leads to unanswered questions concerning the impact such audio properties may have… 

Figures and Tables from this paper

Active Few-Shot Learning for Sound Event Detection

This work developed a novel dataset simulating the long-term temporal characteristics of sound events in real-world environmental soundscapes, and ran a series of experiments to explore the modeling and sampling choices that arise when combining few-shot learning and active learning, including different training schemes, sampling strategies, models, and temporal windows in sampling.

Leveraging Label Hierachies for Few-Shot Everyday Sound Recognition

Experimental results demonstrate this work adopts a hierarchical prototypical network to leverage the knowledge rooted in audio taxonomies to outperform prototypical networks with no hierarchy information but yield a better re-sult than other state-of-the-art algorithms.

Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or

This work shows how to surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures, which would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

Representation Learning for the Automatic Indexing of Sound Effects Libraries

It is shown that a task-species but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size, outperforming established representations such as OpenL3.

Urban Rhapsody: Large‐scale exploration of urban soundscapes

Urban Rhapsody is proposed, a framework that combines state‐of‐the‐art audio representation, machine learning and visual analytics to allow users to interactively create classification models, understand noise patterns of a city, and quickly retrieve and label audio excerpts in order to create a large high‐precision annotated database of urban sound recordings.



Few-Shot Drum Transcription in Polyphonic Music

This work addresses open vocabulary ADT by introducing few-shot learning to the task and shows that, given just a handful of selected examples at inference time, the model can match and in some cases outperform a state-of-the-art supervised ADT approach under a fixed vocabulary setting.

Few-Shot Continual Learning for Audio Classification

This work introduces a few-shot continual learning framework for audio classification, where a trained base classifier is continuously expanded to recognize novel classes based on only few labeled data at inference time, which enables fast and interactive model updates by end-users with minimal human effort.

Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events

This paper aims to achieve few-shot detection of rare sound events, from query sequence that contain not only the target events but also the other events and background noise, and proposes metric learning with background noise class for the few- shot detection.

Few-Shot Sound Event Detection

This work adapts state-of-the-art metric-based few-shot learning methods to automate the detection of similar-sounding events, requiring only one or few examples of the target event, and develops a method to automatically construct a partial set of labeled examples to reduce user labeling effort.

A Closer Look at Few-shot Classification

The results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones, and a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.

Learning to Compare: Relation Network for Few-Shot Learning

A conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each, which is easily extended to zero- shot learning.

Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need?

It is shown that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods.

Matching Networks for One Shot Learning

This work employs ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories to learn a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types.

Prototypical Networks for Few-shot Learning

This work proposes Prototypical Networks for few-shot classification, and provides an analysis showing that some simple design decisions can yield substantial improvements over recent approaches involving complicated architectural choices and meta-learning.