Parametric Resynthesis With Neural Vocoders
@article{Maiti2019ParametricRW, title={Parametric Resynthesis With Neural Vocoders}, author={Soumi Maiti and Michael I. Mandel}, journal={2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year={2019}, pages={303-307} }
Noise suppression systems generally produce output speech with compromised quality. We propose to utilize the high quality speech generation capability of neural vocoders for noise suppression. We use a neural network to predict clean mel-spectrogram features from noisy speech and then compare two neural vocoders, WaveNet and WaveGlow, for synthesizing clean speech from the predicted mel spectrogram. Both WaveNet and WaveGlow achieve better subjective and objective quality scores than the…
9 Citations
Speaker Independence of Neural Vocoders and Their Effect on Parametric Resynthesis Speech Enhancement
- PhysicsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work shows that when trained on data from enough speakers, neural vocoders can generate speech from unseen speakers, both male and female, with similar quality as seen speakers in training, and shows that objective signal and overall quality is higher than the state-of-the-art speech enhancement systems Wave-U-Net, Wavenet-denoise, and SEGAN.
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
- Computer ScienceArXiv
- 2022
Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios.
Comparative Study on Neural Vocoders for Multispeaker Text-To-Speech Synthesis
- Computer Science2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS)
- 2020
The study shows that WaveNet encoder outperforms WavGlow in voice quality analysis in multispeaker text-to-speech synthesis and uses subjective and objective analysis to study the performance of two vocoders.
Universal Speech Enhancement with Score-based Diffusion
- Computer ScienceArXiv
- 2022
This work proposes to consider the task of speech enhancement as a holistic endeavor, and presents a universal speech enhancement system that tackles 55 different distortions at the same time, using a generative model that employs score-based diffusion and a multi-resolution conditioning network that performs enhancement with mixture density networks.
Articulatory-WaveNet: Autoregressive Model For Acoustic-to-Articulatory Inversion
- PhysicsArXiv
- 2020
The proposed Articulatory-WaveNet system uses the WaveNet speech synthesis architecture, with dilated causal convolutional layers using previous values of the predicted articulatory trajectories conditioned on acoustic features to solve the problem of acoustic-to-articulator inversion.
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.
Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet
- Computer ScienceINTERSPEECH
- 2020
This paper introduces the first application of a WaveNet synthesis approach to the problem of Acoustic-to-Articulatory Inversion, and results are comparable to or better than the best currently published systems.
Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks
- Computer Science
- 2021
It is found that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators.
Deep Griffin–Lim Iteration: Trainable Iterative Phase Reconstruction Using Neural Network
- Computer ScienceIEEE Journal of Selected Topics in Signal Processing
- 2021
DeGLI significantly improved both objective and subjective measures from GLA by incorporating the DNN, and its sound quality was comparable to those of neural vocoders.
References
SHOWING 1-10 OF 32 REFERENCES
Concatenative Resynthesis Using Twin Networks
- Computer ScienceINTERSPEECH
- 2017
This work proposes here learning a similarity metric using two separate networks, one network processing the clean segments offline and another processing the noisy segments at run time, which incorporates a ranking loss to optimize for the retrieval of appropriate clean speech segments.
Speech Denoising by Parametric Resynthesis
- PhysicsICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This work proposes the use of clean speech vocoder parameters as the target for a neural network performing speech enhancement and produces a model that equals the oracle Wiener mask in subjective quality and intelligibility and surpasses the realistic system.
Concatenative Resynthesis with Improved Training Signals for Speech Enhancement
- Computer ScienceINTERSPEECH
- 2018
More robust mappings can be learned with a morecient use of the available data by selecting pairings that are not exact matches, but contain similar clean speech that matches the original in terms of acoustic, phonetic, and prosodic content.
Large Vocabulary Concatenative Resynthesis
- Computer ScienceINTERSPEECH
- 2018
This paper generalizes the previous small-vocabulary system to large vocabulary by employing efficient decoding techniques using fast approximate nearest neighbor (ANN) algorithms to construction of a large vocabulary concatenative resynthesis system.
Waveglow: A Flow-based Generative Network for Speech Synthesis
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
Speaker-Dependent WaveNet Vocoder
- Computer ScienceINTERSPEECH
- 2017
A speaker-dependent WaveNet vocoder is proposed, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet.
A Wavenet for Speech Denoising
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps…
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
- Computer ScienceICML
- 2018
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous…
Improved Speech Enhancement with the Wave-U-Net
- Computer ScienceArXiv
- 2018
The Wave-U-Net architecture, a model introduced by Stoller et al for the separation of music vocals and accompaniment, is studied, finding that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music.