• Corpus ID: 225070098

Unsupervised Sound Separation Using Mixture Invariant Training

  title={Unsupervised Sound Separation Using Mixture Invariant Training},
  author={Scott Wisdom and Efthymios Tzinis and Hakan Erdogan and Ron J. Weiss and Kevin W. Wilson and John R. Hershey},
  journal={arXiv: Audio and Speech Processing},
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the… 

Figures and Tables from this paper

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

This work proposes speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor.

Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training

This paper investigates using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus and finds that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets.

DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement

This study aims to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network, and extends the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers.

Reverberation as Supervision for Speech Separation

This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation, and shows that minimiz-ing the scale-invariant signal-to-distortion ratio (SI-SDR) of the predicted right-channel mixture with respect to the ground truth implic-itly guides the network towards separating the left-channel sources.

Audio Signal Enhancement with Learning from Positive and Unlabelled Data

It is observed that the spectrogram patches of noise clips can be used as P data and those of noisy signal clips as unlabelled data, which enable a convolutional neural network to learn to classify each spectrogram patch as P or N for SE through learning from positive and unlabelling data.

Unsupervised Source Separation via Self-Supervised Training

We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures without any access to the ground truth

Unsupervised Speech Enhancement with Speech Recognition Embedding and Disentanglement Losses

  • V. TrinhSebastian Braun
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
The proposed unsupervised loss function is developed by extending the MixIT loss function with speech recognition embedding and disentanglement loss and effectively improves the speech enhancement performance compared to a baseline trained in a supervised way on the noisy VoxCeleb dataset.

Improving Bird Classification with Unsupervised Sound Separation

Improved separation quality is demonstrated when training a MixIT model specifically for birdsong data, outperforming a general audio separation model by over 5 dB in SI-SNR improvement of reconstructed mixtures.

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

This paper introduces new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs to combat over-separation in mixture invariant training.

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.



SDR – Half-baked or Well Done?

It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.

Filterbank Design for End-to-end Speech Separation

The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet and the use of parameterized filterbanks is validated and shows that complex-valued representations and masks are beneficial in all conditions.

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.

Deep clustering: Discriminative embeddings for segmentation and separation

Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.

Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

Universal Sound Separation

A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

What’s all the Fuss about Free Universal Sound Separation Data?

  • Scott WisdomHakan Erdogan J. Hershey
  • Computer Science, Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.

FSD50K: An Open Dataset of Human-Labeled Sound Events

FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

The experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions, and a third test set based on VCTK for speech and WHAM! for noise is introduced.