SEGAN: Speech Enhancement Generative Adversarial Network

  title={SEGAN: Speech Enhancement Generative Adversarial Network},
  author={Santiago Pascual and Antonio Bonafonte and Joan Serr{\`a}},
Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. [] Key Method In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm…

Figures and Tables from this paper

Multi-scale Generative Adversarial Networks for Speech Enhancement
Speech Enhancement Multi-scale Generative Adversarial Networks (SEMGAN), whose generator and discriminator networks are structured on the basis of fully convolutional neural networks (FCNNs) gain a superior performance in comparison with the optimally modified log-spectral amplitude estimator (OMLSA) and SEGAN in different noisy conditions.
Speech Enhancement via Residual Dense Generative Adversarial Network
Simulations show that the proposed speech enhancement method with a residual dense generative adversarial network contributing to map the log-power spectrum of degraded speech to the clean one can still outperform the existing GAN-based methods and masking-based method in the measures of PESQ and other evaluation indexes.
Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training
An adversarial training method to directly boost noise robustness of acoustic model and achieves the average relative error rate reduction of 23.38% and 11.54% on the development and test set, respectively.
Language and Noise Transfer in Speech Enhancement Generative Adversarial Network
This work presents the results of adapting a speech enhancement generative adversarial network by fine-tuning the generator with small amounts of data, and investigates the minimum requirements to obtain a stable behavior in terms of several objective metrics in two very different languages.
CP-GAN: Context Pyramid Generative Adversarial Network for Speech Enhancement
This work makes the first attempt to explore the global and local speech features for coarse-to-fine speech enhancement and introduces a Context Pyramid Generative Adversarial Network (CPGAN), which contains a densely-connected feature pyramid generator and a dynamic context granularity discriminator to better eliminate audio noise hierarchically.
Data augmentation using generative adversarial networks for robust speech recognition
Speech Enhancement via Generative Adversarial LSTM Networks
  • Yang XiangC. Bao
  • Computer Science
    2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC)
  • 2018
Experimental results indicate that the proposed novel framework to conduct speech enhancement can not only improve the quality and intelligibility of noisy speech, but also is competitive to other deep learning-based approaches.
A novel architecture combining the traditional acoustic loss function and the GAN’s discriminative loss under a multi-task learning (MTL) framework is proposed, which improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech.


A Regression Approach to Speech Enhancement Based on Deep Neural Networks
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors.
Speech enhancement based on deep denoising autoencoder
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.
Improved Techniques for Training GANs
This work focuses on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic, and presents ImageNet samples with unprecedented resolution and shows that the methods enable the model to learn recognizable features of ImageNet classes.
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Context Encoders: Feature Learning by Inpainting
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Recurrent Neural Networks for Noise Reduction in Robust ASR
This work introduces a model which uses a deep recurrent auto encoder neural network to denoise input features for robust ASR, and demonstrates the model is competitive with existing feature denoising approaches on the Aurora2 task, and outperforms a tandem approach where deep networks are used to predict phoneme posteriors directly.
Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR
It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task.
Overview of speech enhancement techniques for automatic speaker recognition
A comparative performance analysis of single-channel, dual-channel and multi-channel speech enhancement techniques, with different types of noise at different SNRs, as a pre-processing stage to an ergodic HMM-based speaker recognizer, is presented.
Image-to-Image Translation with Conditional Adversarial Networks
Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.