Training Noisy Single-Channel Speech Separation with Noisy Oracle Sources: A Large Gap and a Small Step

@article{Maciejewski2021TrainingNS,
  title={Training Noisy Single-Channel Speech Separation with Noisy Oracle Sources: A Large Gap and a Small Step},
  author={Matthew Maciejewski and Jing Shi and Shinji Watanabe and Sanjeev Khudanpur},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={5774-5778}
}
As the performance of single-channel speech separation systems has improved, there has been a desire to move to more challenging conditions than the clean, near-field speech that initial systems were developed on. When training deep learning separation models, a need for ground truth leads to training on synthetic mixtures. As such, training in noisy conditions requires either using noise synthetically added to clean speech, preventing the use of in-domain data for a noisy-condition task, or… 

Figures and Tables from this paper

RoSS: Utilizing Robotic Rotation for Audio Source Separation

The Rotational Source Separation module (RoSS) could be plugged into actual robot heads, or into other devices that are also capable of rotation, and shows that the gain translates well to practice provided two mobility-related challenges can be mitigated.

Efficient Personalized Speech Enhancement Through Self-Supervised Learning

This work presents self-supervised learning methods for monaural speaker-specific (i.e., personalized) speech enhancement models. While general-purpose models must broadly address many speakers,

Speaker Verification-Based Evaluation of Single-Channel Speech Separation

This work explores the value of speaker verification as an extrinsic metric of separation quality, with additional utility as evidence of the benefits of separation as pre-processing for downstream tasks.

Training Speech Enhancement Systems with Noisy Speech Datasets

This paper proposes several modifications of the loss functions, which make them robust against noisy speech targets, and proposes a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios.

References

SHOWING 1-10 OF 29 REFERENCES

WHAM!: Extending Speech Separation to Noisy Environments

The WSJ0 Hipster Ambient Mixtures dataset is created, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples, to benchmark various speech separation architectures and objective functions to evaluate their robustness to noise.

Analysis of Robustness of Deep Single-Channel Speech Separation Using Corpora Constructed From Multiple Domains

Investigating the robustness of single-channel speech separation techniques in more realistic environments with multiple and diverse conditions finds that both matched and multi-condition training have significant gaps from the oracle performance in far-field conditions, which advocates a need for extending existing separation techniques to deal with far- field/highly-reverberant speech mixtures.

WHAMR!: Noisy and Reverberant Single-Channel Speech Separation

WHAMR!, an augmented version of WHAM! with synthetic reverberated sources is introduced, and a thorough baseline analysis of current techniques as well as novel cascaded architectures on the newly introduced conditions are provided.

Unsupervised Sound Separation Using Mixtures of Mixtures

This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.

TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation

  • Yi LuoN. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.

Deep Casa for Talker-independent Monaural Speech Separation

This study addresses both speech and nonspeech interference, i.e., monaural speaker separation in noise, in a talker-independent fashion, and extends a recently proposed deep CASA system to deal with noisy speaker mixtures to facilitate speech enhancement.

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

The experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions, and a third test set based on VCTK for speech and WHAM! for noise is introduced.

Universal Sound Separation

A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.

The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines

This paper is intended to be a reference on the 2nd `CHiME' Challenge, an initiative designed to analyze and evaluate the performance of ASR systems in a real-world domestic environment.