• Corpus ID: 49882757

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

@article{Ping2019ClariNetPW,
  title={ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech},
  author={Wei Ping and Kainan Peng and Jitong Chen},
  journal={ArXiv},
  year={2019},
  volume={abs/1807.07281}
}
In this work, we propose a new solution for parallel wave generation by WaveNet. [] Key Method Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we introduce the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately…

Figures and Tables from this paper

Parallel Neural Text-to-Speech
TLDR
This work proposes a non-autoregressive seq2seq model that converts text to spectrogram and builds the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder.
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
TLDR
WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence, and through an iterative refinement process, generates an audio waveform.
WAVEFLOW: A COMPACT FLOW-BASED MODEL
  • Computer Science
  • 2019
TLDR
WaveFlow, a small-footprint generative flow for raw audio, is presented, which is trained with maximum likelihood without density distillation and auxiliary losses as used in Parallel WaveNet, and provides a unified view of flowbased models for rawaudio, including autoregressive flow and bipartite flow as special cases.
LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks
TLDR
A lightweight end-to-end text-tospeech model that can generate high-quality speech at breakneck speed and jointly trains the prosodic embedding network with the speech waveform generation task using an effective domain transfer technique is proposed.
Waveglow: A Flow-based Generative Network for Speech Synthesis
TLDR
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
FastSpeech : Fast , Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up the mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
Parallel WaveNet conditioned on VAE latent vectors
TLDR
The use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model is investigated.
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
TLDR
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Fast Decoding in Sequence Models using Discrete Latent Variables
TLDR
A novel method to extend sequence models using discrete latent variables that makes decoding much more parallelizable and achieves higher scores than previously proposed non-autogregressive translation models on the task of neural machine translation.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
TLDR
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
TLDR
A new neural text tospeech method that is able to transform text to speech in voices that are sampled in the wild and without requiring aligned phonemes or linguistic features is presented, making TTS accessible to a wider range of applications.
Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
TLDR
Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.
...
1
2
3
4
...