What’s all the Fuss about Free Universal Sound Separation Data?

@article{Wisdom2021WhatsAT,
  title={What’s all the Fuss about Free Universal Sound Separation Data?},
  author={Scott Wisdom and Hakan Erdogan and Daniel P. W. Ellis and Romain Serizel and Nicolas Turpault and Eduardo Fonseca and Justin Salamon and Prem Seetharaman and John R. Hershey},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={186-190}
}
  • Scott Wisdom, Hakan Erdogan, J. Hershey
  • Published 2 November 2020
  • Computer Science, Physics
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequency-dependent reflective walls. Additional open… 

Figures and Tables from this paper

Unsupervised Sound Separation Using Mixture Invariant Training
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Unsupervised Sound Separation Using Mixtures of Mixtures
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Learning to Separate Voices by Spatial Regions
TLDR
A two-stage self-supervised framework in which overheard voices from earphones are pre-processed to extract relatively clean personalized signals, which are then used to train a region-wise separation model, underscoring the importance of personalization over a generic supervised approach.
Text-Driven Separation of Arbitrary Sounds
TLDR
This work proposes a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source, by combining two distinct models that are agnostic to the conditioning modal-ity.
Compute and Memory Efficient Universal Sound Source Separation
TLDR
This study provides a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios.
Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data
TLDR
A threecomponent pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet, which achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
TLDR
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.
Few-shot learning of new sound classes for target sound extraction
TLDR
This work proposes combining 1-hotand enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes, and proposes adapting the embedding vectors obtained from a few enrollment audio samples to further improve performance on new classes.
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
TLDR
This paper introduces new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs to combat over-separation in mixture invariant training.
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
TLDR
This paper introduces a TSE framework, SoundBeam, that combines the advantages of both enrollment and enrollment-based approaches, and performs an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of Sound beam.
...
...

References

SHOWING 1-10 OF 31 REFERENCES
Improving Universal Sound Separation Using Sound Classification
TLDR
This paper shows that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information, and establishes a new state-of-the-art for universal sound separation.
WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation
TLDR
A novel trasnformer-based model called Spectro-Temporal Transformer (STT), which highlights temporal and spectral components of sources within a mixture, using a self-attention mechanism, and subsequently disentangles them in a hierarchical manner.
Universal Sound Separation
TLDR
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Unsupervised Sound Separation Using Mixtures of Mixtures
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision
TLDR
This work proposes objective functions and network architectures that enable training a source separation system with weak labels and benchmarks the performance of the algorithm using synthetic mixtures of overlapping events created from a database of sounds recorded in urban environments.
Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis
TLDR
This work proposes a source separation framework trained with weakly labelled data that can separate 527 kinds of sound classes from AudioSet within a single system.
Two-Step Sound Source Separation: Training On Learned Latent Targets
TLDR
This paper proposes a two-step training procedure for source separation via a deep neural network that makes use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and proves that it lower-bounds the SI- SDR in the time domain.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Listening to Each Speaker One by One with Recurrent Selective Hearing Networks
TLDR
This paper casts the source separation problem as a recursive multi-pass source extraction problem based on a recurrent neural network (RNN) that can learn and determine how many computational steps/iterations have to be performed depending on the input signals.
Scaper: A library for soundscape synthesis and augmentation
TLDR
Given a collection of iso-lated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single, probabilistically defined, “specification”, to increase the variability of the output.
...
...