Speech Denoising with Deep Feature Losses

  title={Speech Denoising with Deep Feature Losses},
  author={François G. Germain and Qifeng Chen and Vladlen Koltun},
We present an end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly. [] Key Result The advantage of the new approach is particularly pronounced for the hardest data with the most intrusive background noise, for which denoising is most needed and most challenging.

Figures and Tables from this paper

Multi-objective noisy-based deep feature loss for speech enhancement
  • Rafal Pilarczyk, W. Skarbek
  • Computer Science
    Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA)
  • 2019
This work shows that the use of only deep features in the loss function allows a significant improvement in the measurement of speech signal quality, and believes that deep-feature loss could help in the optimization of functions difficult to differentiate.
Audio Denoising with Deep Network Priors
A method for audio denoising that combines processing done in both the time domain and the time-frequency domain, and only trains on the specific audio clip that is being denoised.
Deep Network Perceptual Losses for Speech Denoising
This work first trained deep neural networks to classify either spoken words or environmental sounds from audio, then trained an audio transform to map noisy speech to an audio waveform that minimized 'perceptual' losses derived from the recognition network.
Improving deep speech denoising by Noisy2Noisy signal mapping
Speech Enhancement Using Deep Learning Methods: A Review
The trend of the deep learning architecture has shifted from the standard deep neural network to convolutional neural network (CNN), which can efficiently learn temporal information of speech signal, and generative adversarial network (GAN), that utilize two networks training.
Speech Denoising with Auditory Models
The results show that deep features can guide speech enhancement, but suggest that they do not yet outperform simple alternatives that do not involve learned features.
Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models
A generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses is introduced and a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.
Speech Enhancement using the Wave-U-Net with Spectral Losses
Speech enhancement and source separation are related tasks that aim to extract and/or improve a signal of interest from a recording that may involve sounds from various sources, reverberation, and/or
Speech Denoising with Residual Attention U-Net
The residual attention U-Net is proposed, which connects the same layer of multiple stacked residual channel attention encoder/decoder models for speech denoising to remove background noises from noisy, monaural speech signals by directly processing a raw waveform.
Deep speech inpainting of time-frequency masks
An end-to-end framework for speech inpainting, the context-based retrieval of missing or severely distorted parts of time-frequency representation of speech, based on a convolutional U-Net trained via deep feature losses obtained using speechVGG, a deep speech feature extractor pre-trained on an auxiliary word classification task.


Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors.
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
A deep neural network for time-domain signal reconstruction
  • Yuxuan Wang, Deliang Wang
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
A new deep network is proposed that directly reconstructs the time-domain clean signal through an inverse fast Fourier transform layer and significantly outperforms a recent non-negative matrix factorization based separation system in both objective speech intelligibility and quality.
Speech enhancement based on deep denoising autoencoder
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.
A Wavenet for Speech Denoising
The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature.
Raw waveform-based speech enhancement by fully convolutional networks
The proposed fully convolutional network (FCN) model can not only effectively recover the waveforms but also outperform the LPS- based DNN baseline in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ).
Speech Enhancement Using Bayesian Wavenet
This paper presents a Bayesian speech enhancement framework, called BaWN (Bayesian WaveNet), which directly operates on raw audio samples and adopts the recently announced WaveNet, which is shown to be effective in modeling conditional distributions of speech samples while generating natural speech.
Speech Enhancement in Multiple-Noise Conditions Using Deep Neural Networks
This paper deals with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech and proposes several strategies based on Deep Neural Networks for speech enhancement in these scenarios.
SEGAN: Speech Enhancement Generative Adversarial Network
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
A Deep Ensemble Learning Method for Monaural Speech Separation
A deep ensemble method, named multicontext networks, is proposed to address monaural speech separation and it is found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.