• Corpus ID: 238583666

PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control

@article{He2021PAMATTSPM,
  title={PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control},
  author={Yunchao He and Jian Luan and Yujun Wang},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.04486}
}
Sequence expansion between encoder and decoder is a critical challenge in sequence-to-sequence TTS. Attention-based methods achieve great naturalness but suffer from unstable issues like missing and repeating phonemes, not to mention accurate duration control. Duration-informed methods, on the contrary, seem to easily adjust phoneme duration but show obvious degradation in speech naturalness. This paper proposes PAMA-TTS to address the problem. It takes the advantage of both flexible attention… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 21 REFERENCES
Tacotron-Based Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems
TLDR
This paper investigates Tacotron-based acoustic models with phoneme alignment instead of attention and shows that the proposed model can realize a high-fidelity TTS system for Japanese with a real-time factor of 0.13 using a GPU without attention prediction errors compared with the seq2seq models.
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS
TLDR
The experimental results show that the proposed stepwise monotonic attention method could achieve significant improvements in robustness on out-of-domain scenarios for phoneme-based models, without any regression on the in-domain naturalness test.
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
TLDR
Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model.
Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis
TLDR
It is concluded that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.
Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis
TLDR
A novel pre-alignment guided attention learning approach is proposed that injects handy prior knowledge–accurate phoneme durations–in the neural network loss function to bias the attention learning to the desired direction more accurately.
VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis
TLDR
Experiments show that VAENARTTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models, and the proposed model is an end-to-end approach that does not require phoneme-level durations.
Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis
TLDR
Experimental results show that the proposed forward attention method achieves faster convergence speed and higher stability than the baseline attention method, and can also help improve the naturalness of synthetic speech and control the speed of syntheticspeech effectively.
Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment
TLDR
AlignTTS is a model based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor, which achieves a high efficiency which is more than 50 times faster than real-time.
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Neural Speech Synthesis with Transformer Network
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
...
1
2
3
...