Unsupervised Sound Separation Using Mixture Invariant Training
@article{Wisdom2020UnsupervisedSS, title={Unsupervised Sound Separation Using Mixture Invariant Training}, author={Scott Wisdom and Efthymios Tzinis and Hakan Erdogan and Ron J. Weiss and Kevin W. Wilson and John R. Hershey}, journal={arXiv: Audio and Speech Processing}, year={2020} }
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the…
38 Citations
Improving Bird Classification with Unsupervised Sound Separation
- Computer Science
- 2022
Improved separation quality is demonstrated when training a MixIT model specifically for birdsong data, outperforming a general audio separation model by over 5 dB in SI-SNR improvement of reconstructed mixtures.
Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
- Computer ScienceArXiv
- 2022
This work proposes speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor.
Unsupervised Source Separation via Self-Supervised Training
- Computer ScienceArXiv
- 2022
We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures without any access to the ground truth…
Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses
- Computer ScienceArXiv
- 2021
The proposed unsupervised loss function is developed by extending the MixIT loss function with speech recognition embedding and disentanglement loss and effectively improves the speech enhancement performance compared to a baseline trained in a supervised way on the noisy VoxCeleb dataset.
AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries
- Computer ScienceACM Multimedia
- 2021
This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description, and shows that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.
Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data
- Computer Science2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2021
This work proposes FedEnhance, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients and shows that it achieves competitive enhancement performance compared to IID training on a single device.
Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
- Computer ScienceArXiv
- 2021
This paper investigates using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus and finds that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets.
DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement
- Computer Science2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2021
This study aims to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network, and extends the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
- Computer ScienceICLR
- 2021
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
- Computer Science2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2021
This paper introduces new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs to combat over-separation in mixture invariant training.
References
SHOWING 1-10 OF 52 REFERENCES
Filterbank Design for End-to-end Speech Separation
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet and the use of parameterized filterbanks is validated and shows that complex-valued representations and masks are beneficial in all conditions.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Librispeech: An ASR corpus based on public domain audio books
- Computer Science2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
SDR – Half-baked or Well Done?
- GeologyICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation
- Physics2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Deep clustering: Discriminative embeddings for segmentation and separation
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
Universal Sound Separation
- Computer Science2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2019
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
FSD50K: An Open Dataset of Human-Labeled Sound Events
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2022
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
Wavesplit: End-to-End Speech Separation by Speaker Clustering
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.
What’s all the Fuss about Free Universal Sound Separation Data?
- Computer Science, PhysicsICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.