Continual Self-Training With Bootstrapped Remixing For Speech Enhancement

  title={Continual Self-Training With Bootstrapped Remixing For Speech Enhancement},
  author={Efthymios Tzinis and Yossi Adi and Vamsi Krishna Ithapu and Buye Xu and Anurag Kumar},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Efthymios TzinisYossi Adi Anurag Kumar
  • Published 19 October 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures. Next, we bootstrap… 
1 Citations

Figures and Tables from this paper

Self-Remixing: Unsupervised Speech Separation via Separation and Remixing

Self-Remixing gives better performance over existing remixing-based self-supervised methods with the same or less training costs under unsupervised setup, and outperforms baselines in semi- supervised domain adaptation.

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

An improved framework to train a monoaural neural enhancement model for robust speech recognition by extending the exist-ing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data is explored.

Unsupervised Source Separation via Self-Supervised Training

We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures without any access to the ground truth

Music Source Separation with Band-split RNN

BSRNN is proposed, a frequency-domain model that explictly splits the spectrogram of the mixture into subbands and perform interleaved band-level and sequence-level modeling and describes a semi-supervised model netuning pipeline that can further improve the performance of the model.

A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement

A systematic comparison between different methods of incorporating phonetic information in a speech enhancement model is conducted, suggesting that using a SSL model as phonetic features outperforms the ASR one in most cases.

Semi-supervised Time Domain Target Speaker Extraction with Attention

This work investigates a two-stage procedure to train the model using mixtures without reference signals upon a pre-trained supervised model, and shows that the proposed semi-supervised learning procedure improves the performance of the supervised baselines.