Tacotron: Towards End-to-End Speech Synthesis

@inproceedings{Wang2017TacotronTE,
  title={Tacotron: Towards End-to-End Speech Synthesis},
  author={Yuxuan Wang and R. J. Skerry-Ryan and Daisy Stanton and Yonghui Wu and Ron J. Weiss and Navdeep Jaitly and Zongheng Yang and Ying Xiao and Z. Chen and Samy Bengio and Quoc V. Le and Yannis Agiomyrgiannakis and Robert A. J. Clark and Rif A. Saurous},
  booktitle={Interspeech},
  year={2017}
}
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. [] Key Method We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's…

Figures and Tables from this paper

Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

Es-Tacotron2, an estimated network which captures general features from a raw mel spectrogram in an unsupervised manner, is proposed and designed, which can produce more variable decoder output and synthesize more natural and expressive speech.

Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis

A sequence-to-sequence neural network which directly generates speech waveforms from text inputs, extending the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop, enabling parallel training and synthesis.

Differentiable Duration Modeling for End-to-End Text-to-Speech

This model learns to perform high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration and obtains competitive results while enjoying a much simpler training pipeline.

Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language

The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis.

Conditional End-to-End Audio Transforms

An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms.

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models

This work aims to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training by investigating the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder.

Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention

This study proposes a transfer learning method to adapt a sequence-to-sequence based TTS system of normal speaking style to Lombard style and results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based T TS system in synthesis of Lombard speech.

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

Qualitative and quantitative evaluations demonstrate the superiority and robustness of the method for lossless speech generation while also showing a strong capability in prosody modeling.

Multi-Speaker End-to-End Speech Synthesis

It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.

End-to-End Adversarial Text-to-Speech

This work takes on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs.
...

References

SHOWING 1-10 OF 26 REFERENCES

First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention

This paper attempts to bypass limitations using a novel end-to-end parametric TTS synthesis framework, i.e. the text analysis and acoustic modeling are integrated together employing an attention-based recurrent neural network.

Deep Voice: Real-time Neural Text-to-Speech

Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.

Char2Wav: End-to-End Speech Synthesis

Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional

Vocaine the vocoder and applications in speech synthesis

  • Yannis Agiomyrgiannakis
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
A new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using quadratic phase splines and a super fast cosine generator is presented.

RNN Approaches to Text Normalization: A Challenge

A data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system is presented, and it is shown that a simple FST-based filter can mitigate errors, and achieve a level of accuracy not achievable by the RNN alone.

Text-to-speech synthesis

An overview of the problems that occur during text-to-speech (TTS) conversion is presented and the particular solutions to these problems taken within the AT&T Bell Laboratories TTS system are described.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Fully Character-Level Neural Machine Translation without Explicit Segmentation

A neural machine translation model that maps a source character sequence to a target character sequence without any segmentation is introduced, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities.

Statistical Parametric Speech Synthesis

  • H. ZenK. TokudaA. Black
  • Computer Science
    2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
  • 2007