Tacotron: Towards End-to-End Speech Synthesis

@inproceedings{Wang2017TacotronTE,
  title={Tacotron: Towards End-to-End Speech Synthesis},
  author={Yuxuan Wang and R. J. Skerry-Ryan and Daisy Stanton and Yonghui Wu and Ron J. Weiss and Navdeep Jaitly and Zongheng Yang and Ying Xiao and Z. Chen and Samy Bengio and Quoc V. Le and Yannis Agiomyrgiannakis and Robert A. J. Clark and Rif A. Saurous},
  booktitle={INTERSPEECH},
  year={2017}
}
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. [...] Key Method We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's…Expand
Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System
TLDR
The subjective and objective evaluation results indicated that the proposed adaptation system coupled with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in the synthesis of Lombard speech. Expand
Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem
TLDR
Es-Tacotron2, an estimated network which captures general features from a raw mel spectrogram in an unsupervised manner, is proposed and designed, which can produce more variable decoder output and synthesize more natural and expressive speech. Expand
Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language
TLDR
The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis. Expand
Conditional End-to-End Audio Transforms
TLDR
An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms. Expand
Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
TLDR
This work aims to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training by investigating the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder. Expand
Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention
TLDR
This study proposes a transfer learning method to adapt a sequence-to-sequence based TTS system of normal speaking style to Lombard style and results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based T TS system in synthesis of Lombard speech. Expand
Multi-Speaker End-to-End Speech Synthesis
TLDR
It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner. Expand
Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis
TLDR
A semi-supervised training framework is proposed to improve the data efficiency of Tacotron and allow it to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora. Expand
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
TLDR
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. Expand
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
TLDR
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention
TLDR
This paper attempts to bypass limitations using a novel end-to-end parametric TTS synthesis framework, i.e. the text analysis and acoustic modeling are integrated together employing an attention-based recurrent neural network. Expand
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations. Expand
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features. Expand
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditionalExpand
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
TLDR
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Expand
Vocaine the vocoder and applications in speech synthesis
  • Yannis Agiomyrgiannakis
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
A new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using quadratic phase splines and a super fast cosine generator is presented. Expand
RNN Approaches to Text Normalization: A Challenge
TLDR
A data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system is presented, and it is shown that a simple FST-based filter can mitigate errors, and achieve a level of accuracy not achievable by the RNN alone. Expand
Text-to-speech synthesis
TLDR
An overview of the problems that occur during text-to-speech (TTS) conversion is presented and the particular solutions to these problems taken within the AT&T Bell Laboratories TTS system are described. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
...
1
2
3
...