Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification

  title={Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification},
  author={Aswin Sivaraman and Sunwoo Kim and Minje Kim},
Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, one may train a personalized speech enhancement model using self-supervised learning. One straightforward approach to model personalization is to use the target speaker’s noisy recordings as pseudo-sources. Then, a pseudo denoising model learns to… 

Figures from this paper

Self-Supervised Learning With Segmental Masking for Speech Representation

This work explores a novel segmental masking strategy that implicitly incorporate the properties of a spoken language, such as phonotactic constraints and duration of phonetic segments, into the pre-training, and consistently outperforms the frame-based masking counterpart.

Semi-supervised Time Domain Target Speaker Extraction with Attention

This work investigates a two-stage procedure to train the model using mixtures without reference signals upon a pre-trained supervised model, and shows that the proposed semi-supervised learning procedure improves the performance of the supervised baselines.

Boosting Self-Supervised Embeddings for Speech Enhancement

A cross-domain feature is used to solve the problem that SSL embeddings may lackained information to regenerate speech signals and demonstrate that in-tegrating the SSL representation and spectrogram can outperform the SOTA SSL-based SE methods in PESQ, CSIG and COVL without invoking complicated network architectures.

Audio Self-supervised Learning: A Survey

An overview of the SSL methods used for audio and speech processing applications, the empirical works that exploit the audio modality in multimodal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain are summarized.

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of the method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task.

Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training

This paper investigates using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus and finds that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets.

Continual Self-Training With Bootstrapped Remixing For Speech Enhancement

The proposed RemixIT method provides a seamless alternative for semi-supervised and unsupervised domain adaptation for speech enhancement tasks, while being general enough to be applied to any separation task and paired with any separation model.

Personalized speech enhancement: new models and Comprehensive evaluation

The results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models, and the multi-task training can alleviate the TSOS issue in addition to improving thespeech recognition accuracy.

Efficient Personalized Speech Enhancement Through Self-Supervised Learning

This work presents self-supervised learning methods for monaural speaker-specific (i.e., personalized) speech enhancement models. While general-purpose models must broadly address many speakers,

Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent

Both ablation and comparison experimental results show that the proposed self-supervised learning based monaural speech enhancement method clearly outperforms the state-of-the-art approaches.



Self-supervised Learning for Speech Enhancement

This work uses a limited training set of clean speech sounds and autoencode on speech mixtures recorded in noisy environments to train the resulting autoencoder to share a latent representation with the clean examples, and shows that it can map noisy speech to its clean version using a network that is autonomously trainable without requiring labeled training examples or human intervention.

Sparse Mixture of Local Experts for Efficient Speech Enhancement

By splitting up the speech denoising task into non-overlapping subproblems and introducing a classifier, this work is able to improveDenoising performance while also reducing computational complexity.

Long short-term memory for speaker generalization in supervised speech separation.

A separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech and which substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility.

SEGAN: Speech Enhancement Generative Adversarial Network

This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

A joint framework combining speech enhancement SE and voice activity detection VAD to increase the speech intelligibility in low signal-noise-ratio SNR environments and demonstrates that the proposed SE approach effectively improves short-time objective intelligibility STOI.

SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement

CNN with the two proposed SNR-aware algorithms outperform the deep neural network counterpart in terms of standardized objective evaluations when using the same number of layers and nodes, suggesting their promising generalization capability for real-world applications.

Supervised Speech Separation Based on Deep Learning: An Overview

This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years, and provides a historical perspective on how advances are made.

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture that tends to achieve significant improvements in terms of various objective quality measures.

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

It is shown that a loss function based on scale-invariant signal-to-distortion ratio (SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.