Audio Barlow Twins: Self-Supervised Audio Representation Learning

  title={Audio Barlow Twins: Self-Supervised Audio Representation Learning},
  author={Jonah Anton and Harry Coppock and Pancham Shukla and Bj{\"o}rn Schuller},
The Barlow Twins self-supervised learning objective requires neither negative samples or asymmetric learning updates, achieving results on a par with the current state-of-the-art within Computer Vision. As such, we present Audio Barlow Twins , a novel self-supervised audio representation learning approach, adapting Barlow Twins to the audio domain. We pre-train on the large-scale audio dataset AudioSet, and evaluate the quality of the learnt representations on 18 tasks from the HEAR 2021… 

Figures and Tables from this paper



Contrastive Learning of General-Purpose Audio Representations

This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the importance of using different pre-training datasets to develop general-purpose audio representations.

Audio self-supervised learning: A survey

A Note on Connecting Barlow Twins with Negative-Sample-Free Contrastive Learning

Compared to the prior state-of-the-art SSL methods, Barlow Twins demonstrates two main properties: its algorithm requires no explicit construction of negative sample pairs, and is not sensitive to large training batch sizes.

CLAR: Contrastive Learning of Auditory Representations

By combining all these methods and with substantially less labeled data, the CLAR framework achieves significant improvement on prediction performance compared to supervised approach and converges faster with significantly better representations.

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

This study hypothesizes that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound and proposes a self-supervised learning method, Bootstrap Your Own Latent for Audio (BYOL-A, pronounced “viola”), which makes the learned representations robust to the perturbations of sounds.

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

This work proposes an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible.

Unsupervised Contrastive Learning of Sound Event Representations

This work proposes to use the pretext task of contrasting differently augmented views of sound events to suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels.

SSAST: Self-Supervised Audio Spectrogram Transformer

This paper proposes to pretrain the Audio Spectrogram Transformer model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech, and is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self- supervised learning framework for AST.