Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech

  title={Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech},
  author={Hyunseung Chung and Sang-Hoon Lee and Seong-Whan Lee},
Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for endto-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to… 

Figures and Tables from this paper

Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss

Regotron is introduced, a regularized version of Tacotron2 which aims to alleviate the training issues and at the same time pro-duce monotonic alignments, while reducing common TTS mistakes and achieving slighlty improved speech naturalness according to subjective mean opinion scores (MOS).



End-to-End Adversarial Text-to-Speech

This work takes on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs.

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Glow-TTS is proposed, a flow-based generative model for parallel TTS that does not require any external aligner and obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality.

FastSpeech: Fast, Robust and Controllable Text to Speech

A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.

Tacotron: Towards End-to-End Speech Synthesis

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

Neural Speech Synthesis with Transformer Network

This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

This work proposes a non-autoregressive architecture called EfficientTTS, which optimizes all its parameters with a stable, end-to-end training procedure, while allowing for synthesizing high quality speech in a fast and efficient manner.

JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment

We propose Jointly trained Duration Informed Transformer (JDI-T), a feed-forward Transformer with a duration predictor jointly trained without explicit alignments in order to generate an acoustic

Deep Voice: Real-time Neural Text-to-Speech

Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.

Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis

A sequence-to-sequence neural network which directly generates speech waveforms from text inputs, extending the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop, enabling parallel training and synthesis.

VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

This paper proposes VARA-TTS1, a nonautoregressive (non-AR) end-to-end text-tospeech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual- to-acoustic alignment layer-wisely and outperforms the use of only a single attention layer in robustness.