What’s all the Fuss about Free Universal Sound Separation Data?

@article{Wisdom2021WhatsAT,
  title={What’s all the Fuss about Free Universal Sound Separation Data?},
  author={Scott Wisdom and Hakan Erdogan and Daniel P. W. Ellis and Romain Serizel and Nicolas Turpault and Eduardo Fonseca and Justin Salamon and Prem Seetharaman and John R. Hershey},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={186-190}
}
  • Scott Wisdom, Hakan Erdogan, J. Hershey
  • Published 2 November 2020
  • Computer Science, Physics
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequency-dependent reflective walls. Additional open… 

Figures and Tables from this paper

Unsupervised Sound Separation Using Mixtures of Mixtures
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data
TLDR
A threecomponent pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet, which achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
Few-shot learning of new sound classes for target sound extraction
TLDR
This work proposes combining 1-hotand enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes, and proposes adapting the embedding vectors obtained from a few enrollment audio samples to further improve performance on new classes.
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
TLDR
This paper introduces new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs to combat over-separation in mixture invariant training.
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
TLDR
This paper introduces a TSE framework, SoundBeam, that combines the advantages of both enrollment and enrollment-based approaches, and performs an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of Sound beam.
Self-Supervised Learning from Automatically Separated Sound Scenes
TLDR
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning and finds that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.
Separate What You Describe: Language-Queried Audio Source Separation
TLDR
This paper proposes LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture.
SA-SDR: A novel loss function for separation of meeting style data
TLDR
This work proposes to switch from a mean over the SDRs of each individual output channel to a global SDR over all output channels at the same time, which it calls source-aggregated SDR (SA-SDR), which makes the loss robust against silence and perfect reconstruction as long as at least one reference signal is not silent.
Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement
TLDR
A novel explanation from the perspective of the low-distortion nature of such algorithms is provided, and it is found that they can consistently improve phase estimation.
On the Compensation Between Magnitude and Phase in Speech Separation
TLDR
A novel view from the perspective of the implicit compensation between estimated magnitude and phase of deep neural network based end-to-end optimization in the complex time-frequency (T-F) domain or time domain is provided.
...
1
2
3
...

References

SHOWING 1-10 OF 31 REFERENCES
WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation
TLDR
A novel trasnformer-based model called Spectro-Temporal Transformer (STT), which highlights temporal and spectral components of sources within a mixture, using a self-attention mechanism, and subsequently disentangles them in a hierarchical manner.
Universal Sound Separation
TLDR
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Unsupervised Sound Separation Using Mixtures of Mixtures
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis
TLDR
This work proposes a source separation framework trained with weakly labelled data that can separate 527 kinds of sound classes from AudioSet within a single system.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Listening to Each Speaker One by One with Recurrent Selective Hearing Networks
TLDR
This paper casts the source separation problem as a recursive multi-pass source extraction problem based on a recurrent neural network (RNN) that can learn and determine how many computational steps/iterations have to be performed depending on the input signals.
Scaper: A library for soundscape synthesis and augmentation
TLDR
Given a collection of iso-lated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single, probabilistically defined, “specification”, to increase the variability of the output.
SDR – Half-baked or Well Done?
TLDR
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
  • Scott Wisdom, J. Hershey, R. Saurous
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This paper presents a new approach to masking that applies mixture consistency to complex-valued short-time Fourier transforms (STFTs) using real-valued masks, and shows that this approach can be effective in speech enhancement.
TUT database for acoustic scene classification and sound event detection
TLDR
The recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models are presented.
...
1
2
3
4
...