A Flow-Based Neural Network for Time Domain Speech Enhancement

  title={A Flow-Based Neural Network for Time Domain Speech Enhancement},
  author={Martin Strauss and Bernd Edler},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Martin Strauss, Bernd Edler
  • Published 2021
  • Engineering, Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speech enhancement involves the distinction of a target speech signal from an intrusive background. Although generative approaches using Variational Autoencoders or Generative Adversarial Networks (GANs) have increasingly been used in recent years, normalizing flow (NF) based systems are still scarse, despite their success in related fields. Thus, in this paper we propose a NF framework to directly model the enhancement process by density estimation of clean speech utterances conditioned on… Expand

Figures and Tables from this paper

A Study on Speech Enhancement Based on Diffusion Probabilistic Model
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus SE task, and relative to the generally suggested full sampling schedule, the proposed supportive reverse process especially improved the fast sampling, taking few steps to yield better enhancement results over the conventional full step inference process. Expand


Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN
This paper proposes to use Wasserstein GAN with gradient penalty and gated activation functions to the autoencoder network of SEGAN which acts as a preprocessing step of speech synthesis. Expand
SEGAN: Speech Enhancement Generative Adversarial Network
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. Expand
Time-Frequency Masking-Based Speech Enhancement Using Generative Adversarial Network
The proposed system significantly improves over a recent GAN-based speech enhancement system in improving speech quality, while maintaining a better trade-off between less speech distortion and more effective removal of background interferences present in the noisy mixture. Expand
A Recurrent Variational Autoencoder for Speech Enhancement
A variational expectation-maximization algorithm where the encoder of the RVAE is finetuned at test time, to approximate the distribution of the latent variables given the noisy speech observations, which is shown to improve the speech enhancement results. Expand
Speech Denoising with Deep Feature Losses
An end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly, which outperforms the state-of-the-art in objective speech quality metrics and in large-scale perceptual experiments with human listeners. Expand
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods. Expand
Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; it extracts a speaker representation used for adaptation directly from the test utterance and uses multi-task learning of speech enhancement and speaker identification, and uses the output of the final hidden layer of speaker identification branch as an auxiliary feature. Expand
A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders
A variational inference method to iteratively estimate the power spectrogram of the clean speech using the en-coder of the pre-learned VAE can be used to estimate the varia-tional approximation of the true posterior distribution, using the very same assumption made to train VAEs. Expand
A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement
The proposed GF-VAE is better than the standard VAE at capturing fine-structured harmonics of speech spectrograms, especially in the high-frequency range, and when these models are used as speech priors for statistical multichannel speech enhancement, the GF- VAE outperforms the VAE and the GF. Expand
Attention Wave-U-Net for Speech Enhancement
It is found that the inclusion of the attention mechanism significantly improves the performance of the model in terms of the objective speech quality metrics, and outperforms all other published speech enhancement approaches on the Voice Bank Corpus (VCTK) dataset. Expand