What’s all the Fuss about Free Universal Sound Separation Data?
@article{Wisdom2021WhatsAT, title={What’s all the Fuss about Free Universal Sound Separation Data?}, author={Scott Wisdom and Hakan Erdogan and Daniel P. W. Ellis and Romain Serizel and Nicolas Turpault and Eduardo Fonseca and Justin Salamon and Prem Seetharaman and John R. Hershey}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={186-190} }
We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequency-dependent reflective walls. Additional open…
29 Citations
Unsupervised Sound Separation Using Mixture Invariant Training
- Computer ScienceNeurIPS
- 2020
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Unsupervised Sound Separation Using Mixtures of Mixtures
- Computer ScienceArXiv
- 2020
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Learning to Separate Voices by Spatial Regions
- Computer Science
- 2022
A two-stage self-supervised framework in which overheard voices from earphones are pre-processed to extract relatively clean personalized signals, which are then used to train a region-wise separation model, underscoring the importance of personalization over a generic supervised approach.
Text-Driven Separation of Arbitrary Sounds
- Computer ScienceArXiv
- 2022
This work proposes a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source, by combining two distinct models that are agnostic to the conditioning modal-ity.
Compute and Memory Efficient Universal Sound Source Separation
- Computer ScienceJ. Signal Process. Syst.
- 2022
This study provides a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios.
Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data
- Computer ScienceProceedings of the AAAI Conference on Artificial Intelligence
- 2022
A threecomponent pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet, which achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
- Computer ScienceICLR
- 2021
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.
Few-shot learning of new sound classes for target sound extraction
- PhysicsInterspeech
- 2021
This work proposes combining 1-hotand enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes, and proposes adapting the embedding vectors obtained from a few enrollment audio samples to further improve performance on new classes.
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
- Computer Science2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2021
This paper introduces new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs to combat over-separation in mixture invariant training.
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
- Computer ScienceArXiv
- 2022
This paper introduces a TSE framework, SoundBeam, that combines the advantages of both enrollment and enrollment-based approaches, and performs an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of Sound beam.
References
SHOWING 1-10 OF 31 REFERENCES
Improving Universal Sound Separation Using Sound Classification
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper shows that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information, and establishes a new state-of-the-art for universal sound separation.
WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation
- Computer ScienceArXiv
- 2019
A novel trasnformer-based model called Spectro-Temporal Transformer (STT), which highlights temporal and spectral components of sources within a mixture, using a self-attention mechanism, and subsequently disentangles them in a hierarchical manner.
Universal Sound Separation
- Computer Science2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2019
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Unsupervised Sound Separation Using Mixtures of Mixtures
- Computer ScienceArXiv
- 2020
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2020
This work proposes objective functions and network architectures that enable training a source separation system with weak labels and benchmarks the performance of the algorithm using synthetic mixtures of overlapping events created from a database of sounds recorded in urban environments.
Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work proposes a source separation framework trained with weakly labelled data that can separate 527 kinds of sound classes from AudioSet within a single system.
Two-Step Sound Source Separation: Training On Learned Latent Targets
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper proposes a two-step training procedure for source separation via a deep neural network that makes use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and proves that it lower-bounds the SI- SDR in the time domain.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Listening to Each Speaker One by One with Recurrent Selective Hearing Networks
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper casts the source separation problem as a recursive multi-pass source extraction problem based on a recurrent neural network (RNN) that can learn and determine how many computational steps/iterations have to be performed depending on the input signals.
Scaper: A library for soundscape synthesis and augmentation
- Computer Science2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2017
Given a collection of iso-lated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single, probabilistically defined, “specification”, to increase the variability of the output.