Share This Author
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
- Jonathan Shen, Ruoming Pang, Yonghui Wu
- Computer ScienceIEEE International Conference on Acoustics…
- 16 December 2017
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps…
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task.
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis
- Daisy Stanton, Yuxuan Wang, R. Skerry-Ryan
- Computer ScienceIEEE Spoken Language Technology Workshop (SLT)
- 4 August 2018
This work introduces the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as “virtual” speaking style labels within Tacotron, and shows that the system can render text with more pitch and energy variation than two state-of-the-art baseline models.
Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis
- Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, R. Skerry-Ryan
- Computer ScienceICASSP - IEEE International Conference on…
- 30 August 2018
A semi-supervised training framework is proposed to improve the data efficiency of Tacotron and allow it to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora.
Uncovering Latent Style Factors for Expressive Speech Synthesis
This preliminary study introduces the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model, and shows that without annotation data or an explicit supervision signal, this approach can automatically learn a variety of prosodic variations in a purely data-driven way.
Semi-Supervised Generative Modeling for Controllable Speech Synthesis
A novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models that is able to reliably discover and control important but rarely labelled attributes of speech.