BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

  title={BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation},
  author={Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
  journal={2021 International Joint Conference on Neural Networks (IJCNN)},
Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced “viola”), an audio… 

Figures and Tables from this paper

One Billion Audio Sounds from GPU-Enabled Modular Synthesis

A multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, which is 100x larger than any audio dataset in the literature, and proposes novel approaches to synthesizer hyperparameter optimization.

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the importance of using different pre-training datasets to develop general-purpose audio representations.

FunnyNet: Audiovisual Learning of Funny Moments in Videos

Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as facial expression, body language,

Efficient Speech Quality Assessment using Self-supervised Framewise Embeddings

This paper proposes ancient system with results comparable to the best performing model in the ConferencingSpeech 2022 challenge, characterized by a smaller number of parameters, fewer FLOPS, lower memory consumption, and lower latency, and contributes to sustainable machine learning.

SLICER: Learning universal audio representations using low-resource self-supervised pre-training

This work proposes SLICER, a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification and achieves state-of-the-art results on the LAPE Benchmark.

Audio Barlow Twins: Self-Supervised Audio Representation Learning

Audio Barlow Twins is presented, a novel self-supervised audio representation learning approach, adapting Barlow twins to the audio domain, and the quality of the learnt representations on 18 tasks from the HEAR 2021 Challenge is evaluated.

ATST: Audio Representation Learning with Teacher-Student Transformer

This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST, which achieves the new state-of-the-art results on almost all of the downstream tasks.

Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load

A novel self-supervised audio representation is designed and evaluated that leverages the effectiveness of handcrafted features (DSP-based) and the complexity of data-driven DNN representations and outperformed both extensive handcrafted feature sets and novel DNN-based audio representation learning approaches.

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

This paper introduces two new general-purpose audio representation learning approaches, the DeLoRes-S and DeLores-M, and proposes to learn embeddings that are invariant to distortions of an input audio sample, while making sure that they contain non-redundant information about the sample.

Towards Learning Universal Audio Representations

A holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains is introduced and a novel normalizer-free Slowfast NFNet is proposed to achieve state-of-the-art performance across all domains.



Contrastive Learning of General-Purpose Audio Representations

This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.

Multi-Task Self-Supervised Learning for Robust Speech Recognition

PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

The results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with the method are well correlated with some acoustic descriptors.

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Unsupervised Learning of Semantic Audio Representations

  • A. JansenM. Plakal R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.

i-Mix: A Strategy for Regularizing Contrastive Representation Learning

It is demonstrated that i-Mix consistently improves the quality of self-supervised representations across domains, resulting in significant performance gains on downstream tasks, and its regularization effect is confirmed via extensive ablation studies across model and dataset sizes.

Self-supervised Learning: Generative or Contrastive

This survey takes a look into new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning, and comprehensively review the existing empirical methods into three main categories according to their objectives.

Unsupervised Contrastive Learning of Sound Event Representations

This work proposes to use the pretext task of contrasting differently augmented views of sound events to suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels.