Universal Sound Separation

@article{Kavalerov2019UniversalSS,
  title={Universal Sound Separation},
  author={Ilya Kavalerov and Scott Wisdom and Hakan Erdogan and Brian Patton and Kevin W. Wilson and Jonathan Le Roux and John R. Hershey},
  journal={2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  year={2019},
  pages={175-179}
}
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based… 

Figures and Tables from this paper

Improving Universal Sound Separation Using Sound Classification
TLDR
This paper shows that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information, and establishes a new state-of-the-art for universal sound separation.
Compute and Memory Efficient Universal Sound Source Separation
TLDR
This study provides a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios.
Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation
TLDR
A time-domain mapping-based algorithm which directly estimate clean speech features in end-to-end system and makes use of an optimal scale-invariant signal to distortion ratio (OSI-SDR) loss function.
What’s all the Fuss about Free Universal Sound Separation Data?
  • Scott Wisdom, Hakan Erdogan, J. Hershey
  • Computer Science, Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
SADDEL: Joint Speech Separation and Denoising Model based on Multitask Learning
TLDR
A joint speech separation and denoising framework based on the multitask learning criterion to tackle the two issues simultaneously is proposed and the experimental results show that the proposed framework not only performs well on both Speech separation and Denoising tasks, but also outperforms related methods in most conditions.
Unsupervised Sound Separation Using Mixture Invariant Training
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Two-Step Sound Source Separation: Training On Learned Latent Targets
TLDR
This paper proposes a two-step training procedure for source separation via a deep neural network that makes use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and proves that it lower-bounds the SI- SDR in the time domain.
One-Shot Conditional Audio Filtering of Arbitrary Sounds
We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a
Sudo RM -RF: Efficient Networks for Universal Audio Source Separation
TLDR
The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF) as well as their aggregation which is performed through simple one-dimensional convolutions.
Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
TLDR
This paper investigates using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus and finds that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets.
...
...

References

SHOWING 1-10 OF 31 REFERENCES
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks
TLDR
A phase-sensitive objective function based on the signal-to-noise ratio (SNR) of the reconstructed signal is developed, and it is shown that in experiments it yields uniformly better results in terms of signal- to-distortion ratio (SDR).
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation
TLDR
A novel architecture that integrates long short-term memory (LSTM) in multiple scales with skip connections to efficiently model long-term structures within an audio context is proposed and yields better results than those obtained using ideal binary masks for a singing voice separation task.
Deep clustering and conventional networks for music separation: Stronger together
TLDR
It is shown that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation.
Monoaural Audio Source Separation Using Deep Convolutional Neural Networks
TLDR
A low-latency monaural source separation framework using a Convolutional Neural Network and the performance of the neural network is evaluated on a database comprising of musical mixtures of three instruments as well as other instruments which vary from song to song.
Speaker-Independent Speech Separation With Deep Attractor Network
TLDR
This work proposes a novel deep learning framework for speech separation that uses a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space and proposes three methods for finding the attractors for each source in the embedded space and compares their advantages and limitations.
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
  • Scott Wisdom, J. Hershey, R. Saurous
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This paper presents a new approach to masking that applies mixture consistency to complex-valued short-time Fourier transforms (STFTs) using real-valued masks, and shows that this approach can be effective in speech enhancement.
Supervised Speech Separation Based on Deep Learning: An Overview
TLDR
This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years, and provides a historical perspective on how advances are made.
...
...