• Corpus ID: 225070098

Unsupervised Sound Separation Using Mixture Invariant Training

@article{Wisdom2020UnsupervisedSS,
  title={Unsupervised Sound Separation Using Mixture Invariant Training},
  author={Scott Wisdom and Efthymios Tzinis and Hakan Erdogan and Ron J. Weiss and Kevin W. Wilson and John R. Hershey},
  journal={arXiv: Audio and Speech Processing},
  year={2020}
}
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the… 

Figures and Tables from this paper

Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
TLDR
This paper investigates using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus and finds that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets.
DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement
TLDR
This study aims to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network, and extends the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers.
Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
TLDR
This work proposes speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor.
Unsupervised Source Separation via Self-Supervised Training
We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures without any access to the ground truth
Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses
TLDR
The proposed unsupervised loss function is developed by extending the MixIT loss function with speech recognition embedding and disentanglement loss and effectively improves the speech enhancement performance compared to a baseline trained in a supervised way on the noisy VoxCeleb dataset.
Improving Bird Classification with Unsupervised Sound Separation
TLDR
Improved separation quality is demonstrated when training a MixIT model specifically for birdsong data, outperforming a general audio separation model by over 5 dB in SI-SNR improvement of reconstructed mixtures.
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
TLDR
This paper introduces new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs to combat over-separation in mixture invariant training.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
TLDR
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.
Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data
TLDR
This work proposes FedEnhance, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients and shows that it achieves competitive enhancement performance compared to IID training on a single device.
AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries
TLDR
This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description, and shows that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.
...
...

References

SHOWING 1-10 OF 52 REFERENCES
SDR – Half-baked or Well Done?
TLDR
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
Filterbank Design for End-to-end Speech Separation
TLDR
The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet and the use of parameterized filterbanks is validated and shows that complex-valued representations and masks are beneficial in all conditions.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation
TLDR
It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Deep clustering: Discriminative embeddings for segmentation and separation
TLDR
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
Librispeech: An ASR corpus based on public domain audio books
TLDR
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Universal Sound Separation
TLDR
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
What’s all the Fuss about Free Universal Sound Separation Data?
  • Scott Wisdom, Hakan Erdogan, J. Hershey
  • Computer Science, Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
FSD50K: An Open Dataset of Human-Labeled Sound Events
TLDR
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
LibriMix: An Open-Source Dataset for Generalizable Speech Separation
TLDR
The experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions, and a third test set based on VCTK for speech and WHAM! for noise is introduced.
...
...