Tacotron: Towards End-to-End Speech Synthesis
@inproceedings{Wang2017TacotronTE, title={Tacotron: Towards End-to-End Speech Synthesis}, author={Yuxuan Wang and R. J. Skerry-Ryan and Daisy Stanton and Yonghui Wu and Ron J. Weiss and Navdeep Jaitly and Zongheng Yang and Ying Xiao and Z. Chen and Samy Bengio and Quoc V. Le and Yannis Agiomyrgiannakis and Robert A. J. Clark and Rif A. Saurous}, booktitle={Interspeech}, year={2017} }
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. [] Key Method We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's…
1,282 Citations
Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem
- Computer ScienceInf.
- 2019
Es-Tacotron2, an estimated network which captures general features from a raw mel spectrogram in an unsupervised manner, is proposed and designed, which can produce more variable decoder output and synthesize more natural and expressive speech.
Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
A sequence-to-sequence neural network which directly generates speech waveforms from text inputs, extending the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop, enabling parallel training and synthesis.
Differentiable Duration Modeling for End-to-End Text-to-Speech
- Computer ScienceArXiv
- 2022
This model learns to perform high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration and obtains competitive results while enjoying a much simpler training pipeline.
Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis.
Conditional End-to-End Audio Transforms
- Computer ScienceINTERSPEECH
- 2018
An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms.
Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
- Computer ScienceArXiv
- 2019
This work aims to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training by investigating the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder.
Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention
- Computer ScienceArXiv
- 2018
This study proposes a transfer learning method to adapt a sequence-to-sequence based TTS system of normal speaking style to Lombard style and results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based T TS system in synthesis of Lombard speech.
Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech
- Computer ScienceArXiv
- 2022
Qualitative and quantitative evaluations demonstrate the superiority and robustness of the method for lossless speech generation while also showing a strong capability in prosody modeling.
Multi-Speaker End-to-End Speech Synthesis
- Computer ScienceArXiv
- 2019
It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.
End-to-End Adversarial Text-to-Speech
- Computer ScienceICLR
- 2021
This work takes on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs.
References
SHOWING 1-10 OF 26 REFERENCES
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention
- Computer ScienceINTERSPEECH
- 2016
This paper attempts to bypass limitations using a novel end-to-end parametric TTS synthesis framework, i.e. the text analysis and acoustic modeling are integrated together employing an attention-based recurrent neural network.
Deep Voice: Real-time Neural Text-to-Speech
- Computer ScienceICML
- 2017
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Char2Wav: End-to-End Speech Synthesis
- Computer ScienceICLR
- 2017
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional…
Vocaine the vocoder and applications in speech synthesis
- Computer Science2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
A new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using quadratic phase splines and a super fast cosine generator is presented.
RNN Approaches to Text Normalization: A Challenge
- Computer ScienceArXiv
- 2016
A data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system is presented, and it is shown that a simple FST-based filter can mitigate errors, and achieve a level of accuracy not achievable by the RNN alone.
Text-to-speech synthesis
- PhysicsAT&T Technical Journal
- 1995
An overview of the problems that occur during text-to-speech (TTS) conversion is presented and the particular solutions to these problems taken within the AT&T Bell Laboratories TTS system are described.
Sequence to Sequence Learning with Neural Networks
- Computer ScienceNIPS
- 2014
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Fully Character-Level Neural Machine Translation without Explicit Segmentation
- Computer ScienceTACL
- 2017
A neural machine translation model that maps a source character sequence to a target character sequence without any segmentation is introduced, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities.
Statistical Parametric Speech Synthesis
- Computer Science2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
- 2007