Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

  title={Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling},
  author={Isaac Elias and Heiga Zen and Jonathan Shen and Yu Zhang and Jia Ye and R. J. Skerry-Ryan and Yonghui Wu},
This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective… Expand

Figures and Tables from this paper

Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling
This work proposes a novel approach that models phone-level prosodies with a GMM-based mixture density network and then extends it for multi-speaker TTS using speaker adaptation transforms of Gaussian means and variances and shows that it can clone the prosody from a reference speech by sampling prosodies from the Gaussian components that produce the reference prosodies. Expand
A Survey on Neural Speech Synthesis
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders. Expand
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal, demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody. Expand
Neural HMMs are all you need (for high-quality attention-free TTS)
This paper replaces the attention in Tacotron 2 with an autoregressive leftright no-skip hidden-Markov model defined by a neural network, which leads to an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximations. Expand
PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control
  • Yunchao He, Jian Luan, Yujun Wang
  • Computer Science
  • ArXiv
  • 2021
Experimental results prove that PAMA-TTS achieves the highest naturalness, while has on-par or even better duration controllability than the duration-informed model. Expand
Phone-Level Prosody Modelling with GMM-Based MDN for Diverse and Controllable Speech Synthesis
  • Chenpeng Du, Kai Yu
  • 2021
Generating natural speech with a diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate differentExpand
Translatotron 2: Robust direct speech-to-speech translation
Experimental results suggest that Translatotron 2 outperforms the original Translattron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. Expand


Parallel Tacotron: Non-Autoregressive and Controllable TTS
  • Isaac Elias, H. Zen, +4 authors Yonghui Wu
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. Expand
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. Expand
Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow
Experiments on LJSpeech show that the speech quality of Flow-TTS heavily approaches that of human and is even better than that of autoregressive model Tacotron 2. Expand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture
This work proposes a non-autoregressive architecture called EfficientTTS, which optimizes all its parameters with a stable, end-to-end training procedure, while allowing for synthesizing high quality speech in a fast and efficient manner. Expand
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech. Expand
Char2Wav: End-to-End Speech Synthesis
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features. Expand
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previousExpand