WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

  title={WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis},
  author={Nanxin Chen and Yu Zhang and Heiga Zen and Ron J. Weiss and Mohammad Norouzi and Najim Dehak and William Chan},
This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditions on mel-spectrogram features, generated by a separate model. The iterative refinement process… Expand
2 Citations

Figures and Tables from this paper

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis
These findings suggest that not only are end-to-end T TS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. Expand
ESPnet2-TTS: Extending the Edge of TTS Research
  • Tomoki Hayashi, Ryuichi Yamamoto, +7 authors Shinji Watanabe
  • Computer Science, Engineering
  • 2021
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-thefly flexibleExpand


Waveglow: A Flow-based Generative Network for Speech Synthesis
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Expand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
FloWaveNet : A Generative Flow for Raw Audio
FloWaveNet is proposed, a flow-based generative model for raw audio synthesis that requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow. Expand
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
The first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end- to-end training from scratch is introduced, which significantly outperforms the previous pipeline that connects a text-To-spectrogram model to a separately trained WaveNet. Expand
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Expand
DurIAN: Duration Informed Attention Network For Multimodal Synthesis
It is shown that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems. Expand
Neural Speech Synthesis with Transformer Network
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. Expand