• Corpus ID: 3469827

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

@article{Ping2018DeepV3,
  title={Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning},
  author={Wei Ping and Kainan Peng and Andrew Gibiansky and Sercan {\"O}. Arik and Ajay Kannan and Sharan Narang and Jonathan Raiman and John Miller},
  journal={arXiv: Sound},
  year={2018}
}
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. [] Key Method In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

Figures and Tables from this paper

Deep Text-to-Speech System with Seq2Seq Model
TLDR
It is shown that the proposed model can achieve attention alignment much faster than previous architectures and that good audio quality can be achieved with a model that's much smaller in size.
Parallel Neural Text-to-Speech
TLDR
This work proposes a non-autoregressive seq2seq model that converts text to spectrogram and builds the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder.
Reformer-TTS: Neural Speech Synthesis with Reformer Network
TLDR
This work proposes Reformer-TTS, the model using a Reformer network which utilizes the locality-sensitive hashing attention and the reversible residual network, which leads to the fast convergence of training end-to-end TTS system.
Attention , I ’ m Trying to Speak CS 224 n Project : Speech Synthesis
We implement an end-to-end parametric text-to-speech synthesis model that produces audio from a sequence of input characters, and demonstrate that it is possible to build a convolutional sequence to
Multi-speaker Sequence-to-sequence Speech Synthesis for Data Augmentation in Acoustic-to-word Speech Recognition
TLDR
This work extends the speech synthesizer so that it can output speech of many speakers and demonstrates that the A2W model trained with the multi-speaker model achieved a significant improvement over the baseline and the single speaker model.
Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition
TLDR
This paper explores how the current speech synthesis technology can be leveraged to tailor the ASR system for a target domain by preparing only a relevant text corpus and generates speech features using a sequence-to-sequence speech synthesizer.
End-to-End Text to Speech Synthesis CS 229 Project Report , Autumn 2018
TLDR
This project applied word/phoneme mapping, signal filter and machine learning techniques, support vector regression, simple neural network, and Seq-2-seq with attention model to transform text to speech, and found that the synthesis system could successfully generate a wav file by inputting a single text.
FULLY CONVOLUTIONAL SEQUENCE-TO-SEQUENCE VOICE CONVERSION
TLDR
This paper proposes a voice conversion method based on fully convolutional sequence-to-sequence (seq2seq) learning that allows the flexible conversion of not only the voice characteristics but also the pitch contour and duration of the input speech.
MAKEDONKA: Applied Deep Learning Model for Text-to-Speech Synthesis in Macedonian Language
TLDR
MAKEDONKA is presented, the first open-source Macedonian language synthesizer that is based on the Deep Learning approach, based on a fully-convolutional sequence-to-sequence acoustic model with a position-augmented attention mechanism—Deep Voice 3.3.
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
TLDR
This paper connects the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task, and finds that the representation must satisfy several important properties to serve as drop-in replacements for text.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Voice Synthesis for in-the-Wild Speakers via a Phonological Loop
TLDR
A new neuralText to speech method that is able to transform text to speech in voices that are sampled in the wild and able to deal with unconstrained samples obtained from public speeches is presented.
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System
TLDR
Apple’s hybrid unit selection speech synthesis system, which provides the voices for Siri with the requirement of naturalness, personality and expressivity, is described and various techniques that enable on-device capability such as preselection optimization, caching for low latency, and unit pruning for low footprint are described.
Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
TLDR
A speaker-adaptive HMM-based speech synthesis system that employs speaker adaptation, feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in previous systems are described.
Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora
TLDR
This paper demonstrates the thousands of voices for HMM-based speech synthesis that are made from several popular ASR corpora such as the Wall Street Journal, Resource Management, Globalphone, and SPEECON databases.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
...
1
2
3
...