• Corpus ID: 219980350

Unsupervised Sound Separation Using Mixtures of Mixtures

@article{Wisdom2020UnsupervisedSS,
  title={Unsupervised Sound Separation Using Mixtures of Mixtures},
  author={Scott Wisdom and Efthymios Tzinis and Hakan Erdogan and Ron J. Weiss and Kevin W. Wilson and John R. Hershey},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.12701}
}
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, the model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. The reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of… 
What’s all the Fuss about Free Universal Sound Separation Data?
  • Scott Wisdom, Hakan Erdogan, +6 authors J. Hershey
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
TLDR
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.
Training Speech Enhancement Systems with Noisy Speech Datasets
TLDR
This paper proposes several modifications of the loss functions, which make them robust against noisy speech targets, and proposes a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios.
Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement
TLDR
This work addresses the few-shot learning scenario where access to cleaning recordings of a test-time speaker is limited to a few seconds, but noisy recordings of the speaker are abundant, and develops a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets.
Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn’s Algorithm
  • Hideyuki Tachibana
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
A SinkPIT is proposed, a novel variant of the PIT losses, which is much more efficient than the ordinary PIT loss when N is large, based on Sinkhorn’s matrix balancing algorithm, which efficiently finds a doubly stochastic matrix which approximates the best permutation in a differentiable manner.
Distortion-Controlled Training for end-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss
TLDR
This paper introduces the "equal-valued contour" problem in reverberant separation where multiple outputs can lead to the same performance measured by the common metrics and investigates how "better" outputs with lower target-specific distortions can be selected by auxiliary autoencoding training (A2T).
Incorporating Real-world Noisy Speech in Neural-network-based Speech Enhancement Systems
TLDR
This paper proposes a semi-supervised approach for speech enhancement in which a modified vector-quantized variational autoencoder is trained that solves a source separation task and is used to further train an enhancement network using real-world noisy speech data by computing a triplet-based unsupervised loss function.
Visual Scene Graphs for Audio Source Separation
TLDR
An “in the wild” video dataset for sound source separation that contains multiple non-musical sources, which is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation.
Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording
TLDR
This work adopts the SSUSI model in long recordings and proposes a self-informed, clustering-based inventory forming scheme for long recording, where the speaker inventory is fully built from the input signal without the need for external speaker signals.
Unified Gradient Reweighting for Model Biasing with Applications to Source Separation
TLDR
A simple, unified gradient reweighting scheme, with a lightweight modification to bias the learning process of a model and steer it towards a certain distribution of results, using a user-specified probability distribution.
...
1
2
...

References

SHOWING 1-10 OF 52 REFERENCES
Bootstrapping Single-channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures
TLDR
The idea is to use simple, low-level processing to separate sources in an unsupervised fashion, identify easy conditions, and then use that knowledge to bootstrap a (self-)supervised source separation model for difficult conditions.
Improving Universal Sound Separation Using Sound Classification
TLDR
This paper shows that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information, and establishes a new state-of-the-art for universal sound separation.
What’s all the Fuss about Free Universal Sound Separation Data?
  • Scott Wisdom, Hakan Erdogan, +6 authors J. Hershey
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
Universal Sound Separation
TLDR
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Mixup-breakdown: A Consistency Training Method for Improving Generalization of Speech Separation Models
  • Max W. Y. Lam, J. Wang, Dan Su, Dong Yu
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
In this experiment, MBT is evaluated under various conditions with ascending degrees of mismatch, including unseen interfering speech, noise, and music, and it is indicated that MBT significantly outperforms several strong baselines with up to 13.77% relative SI-SNRi improvement.
Unsupervised Deep Clustering for Source Separation: Direct Learning from Mixtures Using Spatial Information
TLDR
A deep clustering approach is used which trains on multichannel mixtures and learns to project spectrogram bins to source clusters that correlate with various spatial features, and shows that this system is capable of performing sound separation on monophonic inputs, despite having learned how to do so using multi-channel recordings.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Deep clustering: Discriminative embeddings for segmentation and separation
TLDR
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science, Engineering
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.
Speech separation using speaker-adapted eigenvoice speech models
TLDR
An algorithm to infer the characteristics of the sources present in a mixture is presented, allowing for significantly improved separation performance over that obtained using unadapted source models.
...
1
2
3
4
5
...