A Wavenet for Speech Denoising

@article{Rethage2018AWF,
  title={A Wavenet for Speech Denoising},
  author={Dario Rethage and Jordi Pons and Xavier Serra},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={5069-5073}
}
  • D. Rethage, Jordi Pons, X. Serra
  • Published 22 June 2017
  • Computer Science
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Most speech processing techniques use magnitude spectrograms as front-end and are therefore by default discarding part of the signal: the phase. [] Key Method Specifically, the model makes use of non-causal, dilated convolutions and predicts target fields instead of a single target sample. The discriminative adaptation of the model we propose, learns in a supervised fashion via minimizing a regression loss. These modifications make the model highly parallelizable during both training and inference. Both…

Figures and Tables from this paper

Speech Denoising with Deep Feature Losses
TLDR
An end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly, which outperforms the state-of-the-art in objective speech quality metrics and in large-scale perceptual experiments with human listeners.
Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that,
Towards end-to-end speech enhancement with a variational U-Net architecture
TLDR
Experiments show that the residual (skip) connections in the proposed system are required for successful end-to-end signal enhancement, i.e., without filter mask estimation, and indicate a slight advantage of the variational U-Net architecture over its non-variational version in terms of signal enhancement performance under reverberant conditions.
Speech Enhancement with Variance Constrained Autoencoders
TLDR
This work proposes using the Variance Constrained Autoencoder (VCAE) for speech enhancement and demonstrates experimentally that the proposed enhancement model outperforms SE-GAN and SE-WaveNet in terms of perceptual quality of enhanced signals.
Speech Enhancement using the Wave-U-Net with Spectral Losses
Speech enhancement and source separation are related tasks that aim to extract and/or improve a signal of interest from a recording that may involve sounds from various sources, reverberation, and/or
Residual Recurrent Neural Network for Speech Enhancement
TLDR
This work introduces an end-to-end fully recurrent fully recurrent neural network structured as an hourglass-shape that can efficiently capture longrange temporal dependencies by reducing the features resolution without information loss.
Speech Denoising in the Waveform Domain with Self-Attention
TLDR
CleanUNet is presented, a causal speech denoising model on the raw waveform based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results.
A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech
TLDR
This work proposes PercepNet, an efficient approach that relies on human perception of speech by focusing on the spectral envelope and on the periodicity of the speech, and demonstrates high-quality, real-time enhancement of fullband speech with less than 5% of a CPU core.
Applications of deep learning to speech enhancement.
TLDR
This work proposes a model to perform speech dereverberation by estimating its spectral magnitude from the reverberant counterpart and proposes a method to prune those neurons away from the model without having an impact in performance, and compares this method to other methods in the literature.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
Speech Enhancement Using Bayesian Wavenet
TLDR
This paper presents a Bayesian speech enhancement framework, called BaWN (Bayesian WaveNet), which directly operates on raw audio samples and adopts the recently announced WaveNet, which is shown to be effective in modeling conditional distributions of speech samples while generating natural speech.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
TLDR
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
Learning Multiscale Features Directly from Waveforms
TLDR
This paper details an approach to use convolutional filters to push past the inherent tradeoff of temporal and frequency resolution that exists for spectral representations and finds more efficient representations by simultaneously learning at multiple scales.
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
TLDR
Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors.
Speech enhancement based on deep denoising autoencoder
TLDR
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.
SEGAN: Speech Enhancement Generative Adversarial Network
TLDR
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
End-To-End Source Separation With Adaptive Front-Ends
TLDR
An auto-encoder neural network is developed that can act as an equivalent to short-time front-end transforms and demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal.
End-to-end learning for music audio
  • S. Dieleman, B. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
...
...