Parametric Resynthesis With Neural Vocoders

  title={Parametric Resynthesis With Neural Vocoders},
  author={Soumi Maiti and Michael I. Mandel},
  journal={2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  • Soumi Maiti, Michael I. Mandel
  • Published 16 June 2019
  • Computer Science
  • 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Noise suppression systems generally produce output speech with compromised quality. We propose to utilize the high quality speech generation capability of neural vocoders for noise suppression. We use a neural network to predict clean mel-spectrogram features from noisy speech and then compare two neural vocoders, WaveNet and WaveGlow, for synthesizing clean speech from the predicted mel spectrogram. Both WaveNet and WaveGlow achieve better subjective and objective quality scores than the… 

Figures and Tables from this paper

Speaker Independence of Neural Vocoders and Their Effect on Parametric Resynthesis Speech Enhancement
  • Soumi Maiti, Michael I. Mandel
  • Physics
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This work shows that when trained on data from enough speakers, neural vocoders can generate speech from unseen speakers, both male and female, with similar quality as seen speakers in training, and shows that objective signal and overall quality is higher than the state-of-the-art speech enhancement systems Wave-U-Net, Wavenet-denoise, and SEGAN.
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios.
Comparative Study on Neural Vocoders for Multispeaker Text-To-Speech Synthesis
The study shows that WaveNet encoder outperforms WavGlow in voice quality analysis in multispeaker text-to-speech synthesis and uses subjective and objective analysis to study the performance of two vocoders.
Universal Speech Enhancement with Score-based Diffusion
This work proposes to consider the task of speech enhancement as a holistic endeavor, and presents a universal speech enhancement system that tackles 55 different distortions at the same time, using a generative model that employs score-based diffusion and a multi-resolution conditioning network that performs enhancement with mixture density networks.
Articulatory-WaveNet: Autoregressive Model For Acoustic-to-Articulatory Inversion
The proposed Articulatory-WaveNet system uses the WaveNet speech synthesis architecture, with dilated causal convolutional layers using previous values of the predicted articulatory trajectories conditioned on acoustic features to solve the problem of acoustic-to-articulator inversion.
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning
This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.
Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet
This paper introduces the first application of a WaveNet synthesis approach to the problem of Acoustic-to-Articulatory Inversion, and results are comparable to or better than the best currently published systems.
Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks
It is found that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators.
Deep Griffin–Lim Iteration: Trainable Iterative Phase Reconstruction Using Neural Network
DeGLI significantly improved both objective and subjective measures from GLA by incorporating the DNN, and its sound quality was comparable to those of neural vocoders.


Concatenative Resynthesis Using Twin Networks
This work proposes here learning a similarity metric using two separate networks, one network processing the clean segments offline and another processing the noisy segments at run time, which incorporates a ranking loss to optimize for the retrieval of appropriate clean speech segments.
Speech Denoising by Parametric Resynthesis
  • Soumi Maiti, Michael I. Mandel
  • Physics
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This work proposes the use of clean speech vocoder parameters as the target for a neural network performing speech enhancement and produces a model that equals the oracle Wiener mask in subjective quality and intelligibility and surpasses the realistic system.
Concatenative Resynthesis with Improved Training Signals for Speech Enhancement
More robust mappings can be learned with a morecient use of the available data by selecting pairings that are not exact matches, but contain similar clean speech that matches the original in terms of acoustic, phonetic, and prosodic content.
Large Vocabulary Concatenative Resynthesis
This paper generalizes the previous small-vocabulary system to large vocabulary by employing efficient decoding techniques using fast approximate nearest neighbor (ANN) algorithms to construction of a large vocabulary concatenative resynthesis system.
Waveglow: A Flow-based Generative Network for Speech Synthesis
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
Speaker-Dependent WaveNet Vocoder
A speaker-dependent WaveNet vocoder is proposed, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet.
A Wavenet for Speech Denoising
The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous
Improved Speech Enhancement with the Wave-U-Net
The Wave-U-Net architecture, a model introduced by Stoller et al for the separation of music vocals and accompaniment, is studied, finding that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music.