• Corpus ID: 5580515

Deep Voice: Real-time Neural Text-to-Speech

@article{Arik2017DeepVR,
  title={Deep Voice: Real-time Neural Text-to-Speech},
  author={Sercan {\"O}. Arik and Mike Chrzanowski and Adam Coates and Gregory Frederick Diamos and Andrew Gibiansky and Yongguo Kang and Xian Li and John Miller and Andrew Ng and Jonathan Raiman and Shubho Sengupta and Mohammad Shoeybi},
  journal={ArXiv},
  year={2017},
  volume={abs/1702.07825}
}
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. [] Key Method For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original.

Figures and Tables from this paper

Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
FastSpeech : Fast , Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up the mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Reformer-TTS: Neural Speech Synthesis with Reformer Network
TLDR
This work proposes Reformer-TTS, the model using a Reformer network which utilizes the locality-sensitive hashing attention and the reversible residual network, which leads to the fast convergence of training end-to-end TTS system.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Differentiable Duration Modeling for End-to-End Text-to-Speech
TLDR
This paper proposes a differentiable duration method for learning monotonic alignments between input and output sequences based on a soft-duration mechanism that optimizes a stochastic process in expectation.
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
TLDR
A new neural text tospeech method that is able to transform text to speech in voices that are sampled in the wild and without requiring aligned phonemes or linguistic features is presented, making TTS accessible to a wider range of applications.
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
TLDR
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task.
Voice Synthesis for in-the-Wild Speakers via a Phonological Loop
TLDR
A new neuralText to speech method that is able to transform text to speech in voices that are sampled in the wild and able to deal with unconstrained samples obtained from public speeches is presented.
Neural Text-to-Speech Adaptation from Low Quality Public Recordings
TLDR
This work introduces meta-learning to adapt the neural TTS front-end and shows that for low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
A Template-Based Approach for Speech Synthesis Intonation Generation Using LSTMs
TLDR
A template-based approach for automatic F0 generation, where per-syllable pitchcontour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN) and is able to reproduce pitch patterns observed in the data.
Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks
TLDR
This work proposes a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN) that has the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word- to-pronunciation conversion.
Statistical parametric speech synthesis using deep neural networks
TLDR
This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis
  • H. Zen, H. Sak
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
Experimental results in subjective listening tests show that the proposed architecture can synthesize natural sounding speech without requiring utterance-level batch processing.
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Sequence-to-sequence neural net models for grapheme-to-phoneme conversion
TLDR
The simple side-conditioned generation approach is able to rival the state-of-the-art with bi-directional long short-term memory (LSTM) neural networks that use the same alignment information that is used in conventional approaches.
WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications
TLDR
A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
TLDR
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
...
1
2
3
4
...