ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
@article{Ping2019ClariNetPW, title={ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech}, author={Wei Ping and Kainan Peng and Jitong Chen}, journal={ArXiv}, year={2019}, volume={abs/1807.07281} }
In this work, we propose a new solution for parallel wave generation by WaveNet. [] Key Method Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we introduce the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately…
249 Citations
Parallel Neural Text-to-Speech
- Computer ScienceArXiv
- 2019
This work proposes a non-autoregressive seq2seq model that converts text to spectrogram and builds the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder.
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
- Computer ScienceInterspeech
- 2021
WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence, and through an iterative refinement process, generates an audio waveform.
WAVEFLOW: A COMPACT FLOW-BASED MODEL
- Computer Science
- 2019
WaveFlow, a small-footprint generative flow for raw audio, is presented, which is trained with maximum likelihood without density distillation and auxiliary losses as used in Parallel WaveNet, and provides a unified view of flowbased models for rawaudio, including autoregressive flow and bipartite flow as special cases.
LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks
- Computer ScienceInterspeech
- 2021
A lightweight end-to-end text-tospeech model that can generate high-quality speech at breakneck speed and jointly trains the prosodic embedding network with the speech waveform generation task using an effective domain transfer technique is proposed.
Waveglow: A Flow-based Generative Network for Speech Synthesis
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
FastSpeech : Fast , Robust and Controllable Text to Speech
- Computer Science
- 2019
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up the mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
FastSpeech: Fast, Robust and Controllable Text to Speech
- Computer ScienceNeurIPS
- 2019
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
- Computer ScienceINTERSPEECH
- 2019
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
Parallel WaveNet conditioned on VAE latent vectors
- Computer ScienceArXiv
- 2020
The use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model is investigated.
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.
References
SHOWING 1-10 OF 39 REFERENCES
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
- Computer ScienceICML
- 2018
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous…
Tacotron: Towards End-to-End Speech Synthesis
- Computer ScienceINTERSPEECH
- 2017
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
WaveNet: A Generative Model for Raw Audio
- Computer ScienceSSW
- 2016
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Deep Voice: Real-time Neural Text-to-Speech
- Computer ScienceICML
- 2017
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Char2Wav: End-to-End Speech Synthesis
- Computer ScienceICLR
- 2017
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Fast Decoding in Sequence Models using Discrete Latent Variables
- Computer ScienceICML
- 2018
A novel method to extend sequence models using discrete latent variables that makes decoding much more parallelizable and achieves higher scores than previously proposed non-autogregressive translation models on the task of neural machine translation.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
- Computer ScienceICLR
- 2017
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps…
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
- Computer ScienceICLR
- 2018
A new neural text tospeech method that is able to transform text to speech in voices that are sampled in the wild and without requiring aligned phonemes or linguistic features is presented, making TTS accessible to a wider range of applications.
Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
- PhysicsIEEE Access
- 2018
Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.