• Corpus ID: 26100519

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

@article{Ping2017DeepV3,
  title={Deep Voice 3: 2000-Speaker Neural Text-to-Speech},
  author={Wei Ping and Kainan Peng and Andrew Gibiansky and Sercan {\"O}. Arik and Ajay Kannan and Sharan Narang and Jonathan Raiman and John Miller},
  journal={ArXiv},
  year={2017},
  volume={abs/1710.07654}
}
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare… 
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning
TLDR
The proposed approach has the goal to overcome limitations trying to obtain a system which is able to model a multi-speaker acoustic space and allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
Neural Text-to-Speech Adaptation from Low Quality Public Recordings
TLDR
This work introduces meta-learning to adapt the neural TTS front-end and shows that for low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
Deep Text-to-Speech System with Seq2Seq Model
TLDR
It is shown that the proposed model can achieve attention alignment much faster than previous architectures and that good audio quality can be achieved with a model that's much smaller in size.
Lightspeech: Lightweight Non-Autoregressive Multi-Speaker Text-To-Speech
TLDR
A new lightweight non-autoregressive multi-speaker speech synthesis system, named LightSpeech, which utilizes the lightweight feedforward neural networks to accelerate synthesis and reduce the amount of parameters.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Shared model for multi-source speech generation tasks
Many speech technologies contain speech generating stage, such as text-to-speech (TTS), voice conversion (VC), speech enhancement (SE). Recent advances in deep learning based methods significantly
Multi-Speaker End-to-End Speech Synthesis
TLDR
It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.
Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems
TLDR
This work extends state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself, closing the gap to a comparable oracle experiment by more than 50%.
Sample Efficient Adaptive Text-to-Speech
TLDR
Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
AdaSpeech: Adaptive Text to Speech for Custom Voice
TLDR
AdaSpeech is proposed, an adaptive TTS system for high-quality and efficient customization of new voices and achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 30 REFERENCES
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Voice Synthesis for in-the-Wild Speakers via a Phonological Loop
TLDR
A new neuralText to speech method that is able to transform text to speech in voices that are sampled in the wild and able to deal with unconstrained samples obtained from public speeches is presented.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
TLDR
A speaker-adaptive HMM-based speech synthesis system that employs speaker adaptation, feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in previous systems are described.
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
TLDR
A new neural text tospeech method that is able to transform text to speech in voices that are sampled in the wild and without requiring aligned phonemes or linguistic features is presented, making TTS accessible to a wider range of applications.
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora
TLDR
This paper demonstrates the thousands of voices for HMM-based speech synthesis that are made from several popular ASR corpora such as the Wall Street Journal, Resource Management, Globalphone, and SPEECON databases.
Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System
TLDR
Apple’s hybrid unit selection speech synthesis system, which provides the voices for Siri with the requirement of naturalness, personality and expressivity, is described and various techniques that enable on-device capability such as preselection optimization, caching for low latency, and unit pruning for low footprint are described.
Text-to-Speech Synthesis
TLDR
Text-to-Speech Synthesis provides an in-depth explanation of all aspects of current speech synthesis technology, and is designed for graduate students in electrical engineering, computer science, and linguistics.
...
1
2
3
...