• Publications
  • Influence
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
TLDR
The first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end- to-end training from scratch is introduced, which significantly outperforms the previous pipeline that connects a text-To-spectrogram model to a separately trained WaveNet.
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Neural Voice Cloning with a Few Samples
TLDR
While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Non-Autoregressive Neural Text-to-Speech
TLDR
ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram is proposed, which is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality.
WaveFlow: A Compact Flow-based Model for Raw Audio
TLDR
WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.
Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework
TLDR
This work proposes a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation, which achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1-2 words) latency.
CONVOLUTIONAL SEQUENCE LEARNING
We present Deep Voice 3, a fully-convolutional attention-based neural textto-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training
Parallel Neural Text-to-Speech
TLDR
This work proposes a non-autoregressive seq2seq model that converts text to spectrogram and builds the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder.
...
1
2
...