Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

  title={Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation},
  author={Scott Wisdom and Aren Jansen and Ron J. Weiss and Hakan Erdogan and John R. Hershey},
  journal={2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  • Scott Wisdom, A. Jansen, J. Hershey
  • Published 1 June 2021
  • Computer Science
  • 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are… 

Figures and Tables from this paper

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation
A novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models.
Unsupervised Source Separation via Self-Supervised Training
We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures without any access to the ground truth
Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
This paper investigates using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus and finds that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets.
Unsupervised Audio Source Separation Using Differentiable Parametric Source Models
This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent and Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency.
DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction
A densely-connected pyramid complex convolutional network, termed DPCCN, is proposed to improve the robustness of speech separation under complicated conditions and is generalized to target speech extraction (TSE) by integrating a new specially designed speaker encoder.


Unsupervised Sound Separation Using Mixture Invariant Training
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Deep clustering: Discriminative embeddings for segmentation and separation
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
Improving Universal Sound Separation Using Sound Classification
This paper shows that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information, and establishes a new state-of-the-art for universal sound separation.
What’s all the Fuss about Free Universal Sound Separation Data?
  • Scott Wisdom, Hakan Erdogan, J. Hershey
  • Computer Science, Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
LibriMix: An Open-Source Dataset for Generalizable Speech Separation
The experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions, and a third test set based on VCTK for speech and WHAM! for noise is introduced.
Universal Sound Separation
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
  • Scott Wisdom, J. Hershey, R. Saurous
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This paper presents a new approach to masking that applies mixture consistency to complex-valued short-time Fourier transforms (STFTs) using real-valued masks, and shows that this approach can be effective in speech enhancement.
Single-Channel Multi-Speaker Separation Using Deep Clustering
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
Ratio and difference of $l_1$ and $l_2$ norms and sparse representation with coherent dictionaries
The mathematical theory of the sparsity promoting properties of the ratio metric in the context of basis pursuit via over-complete dictionaries is studied and sequentially convex algorithms are introduced to illustrate how the ratio and difference penalties are computed to produce both stable and sparse solutions.