Self-Supervised Learning from Automatically Separated Sound Scenes

  title={Self-Supervised Learning from Automatically Separated Sound Scenes},
  author={Eduardo Fonseca and Aren Jansen and Daniel P. W. Ellis and Scott Wisdom and Marco Tagliasacchi and John R. Hershey and Manoj Plakal and Shawn Hershey and R. Channing Moore and Xavier Serra},
  journal={2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  • Eduardo Fonseca, A. Jansen, X. Serra
  • Published 5 May 2021
  • Computer Science
  • 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound… 

Figures and Tables from this paper

Unsupervised Source Separation via Self-Supervised Training
We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures without any access to the ground truth
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks
This paper combines the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameterefficient conformer architectures and proposes a self- supervised audio representation learning method that achieves a new state-of-the-art score on the AudioSet benchmark.
Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment
This work presents a method for self-supervised representation learning based on audio- visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC).
FSD50K: An Open Dataset of Human-Labeled Sound Events
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.


Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Unsupervised Sound Separation Using Mixture Invariant Training
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
  • A. Jansen, D. Ellis, R. Saurous
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A learning framework for sound representation and recognition that combines a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, a clustering objective that reflects the authors' need to impose categorical structure on their experiences, and a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes is presented.
What’s all the Fuss about Free Universal Sound Separation Data?
  • Scott Wisdom, Hakan Erdogan, J. Hershey
  • Computer Science, Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
Unsupervised Contrastive Learning of Sound Event Representations
This work proposes to use the pretext task of contrasting differently augmented views of sound events to suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels.
Contrastive Learning of General-Purpose Audio Representations
This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
Universal Sound Separation
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Look, Listen and Learn
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
Unsupervised Feature Learning via Non-parametric Instance Discrimination
This work forms this intuition as a non-parametric classification problem at the instance-level, and uses noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes.