Towards Learning Universal Audio Representations

  title={Towards Learning Universal Audio Representations},
  author={Luyu Wang and Pauline Luc and Yan Wu and Adri{\`a} Recasens and Lucas Smaira and Andrew Brock and Andrew Jaegle and Jean-Baptiste Alayrac and Sander Dieleman and Jo{\~a}o Carreira and A{\"a}ron van den Oord},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Luyu WangPauline Luc Aäron van den Oord
  • Published 23 November 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark. We discover that previous sound event classification or speech models… 

Tables from this paper

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

This study hypothesizes that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound and proposes a self-supervised learning method, Bootstrap Your Own Latent for Audio (BYOL-A, pronounced “viola”), which makes the learned representations robust to the perturbations of sounds.

Decorrelating Feature Spaces for Learning General-Purpose Audio Representations

This paper proposes learning embeddings invariant to distortions of an input audio sample while ensuring that they contain non-redundant information about the sample to make the authors' network learn representations in a resource-constrained setting that can generalize well across a diverse set of downstream tasks.

HEAR: Holistic Evaluation of Audio Representations

The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios, including speech, environmental sound, and music.

HEAR 2021: Holistic Evaluation of Audio Representations

Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies.

Supervised and Unsupervised Learning of Audio Representations for Music Understanding

This work shows that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies, and restricts the domain of the pre-training dataset to music to allow for training with smaller batch sizes.

Audio self-supervised learning: A survey

SLICER: Learning universal audio representations using low-resource self-supervised pre-training

This work proposes SLICER, a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification and achieves state-of-the-art results on the LAPE Benchmark.

Learning Music Representations with wav2vec 2.0

The results show that wav2vec 2.0 pre-trained on music data allows us to achieve promising results on music classification tasks that are competitive with prior work on audio representations.

Learning neural audio features without supervision

First, it is shown that pretraining two previously proposed frontends (SincNet and LEAF) on Audioset drastically improves linear-probe performance over mel-filterbanks, suggesting that learnable time-frequency representations can bene-t self-supervised pre-training even more than supervised training.

Self-Supervised Speech Representation Learning: A Review

This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.



Contrastive Learning of General-Purpose Audio Representations

This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.

Contrastive Learning of Musical Representations

It is shown that CLMR’s representations are transferable using out-of-domain datasets, indicating that the method has strong generalisability in music classification and to foster reproducibility and future research on self-supervised learning in music, the models and source code are publicly released.

Pre-Training Audio Representations With Self-Supervision

This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.

Contrastive Predictive Coding of Audio with an Adversary

This work investigates learning general audio representations directly from raw signals using the Contrastive Predictive Coding objective and extends it by leveraging ideas from adversarial machine learning to produce additive perturbations that effectively makes the learning harder, such that the predictive tasks will not be distracted by trivial details.

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning

Multi-Format Contrastive Learning of Audio Representations

This work investigates the use of the contrastive learning framework to learn audio representations by maximizing the agreement between the raw audio and its spectral representation and finds a significant gain using this multi-format strategy against the single-format counterparts.

Towards Learning a Universal Non-Semantic Representation of Speech

This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective that outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks.

SUPERB: Speech processing Universal PERformance Benchmark

A simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model for its preferable re-usability and results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SuperB tasks.

Unsupervised Learning of Semantic Audio Representations

  • A. JansenM. Plakal R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.

End-to-end Learning for Music Audio Tagging at Scale

This work focuses on studying how waveform-based models outperform spectrogram-based ones in large-scale data scenarios when datasets of variable size are available for training, suggesting that music domain assumptions are relevant when not enough training data are available.