Corpus ID: 162169005

FastSpeech: Fast, Robust and Controllable Text to Speech

@inproceedings{Ren2019FastSpeechFR,
  title={FastSpeech: Fast, Robust and Controllable Text to Speech},
  author={Yi Ren and Yangjun Ruan and Xu Tan and Tao Qin and Sheng Zhao and Zhou Zhao and Tie-Yan Liu},
  booktitle={NeurIPS},
  year={2019}
}
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. [...] Key Method Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of target mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive…Expand
Reformer-TTS: Neural Speech Synthesis with Reformer Network
TLDR
This work proposes Reformer-TTS, the model using a Reformer network which utilizes the locality-sensitive hashing attention and the reversible residual network, which leads to the fast convergence of training end-to-end TTS system. Expand
FPETS: Fully Parallel End-to-End Text-to-Speech System
TLDR
Experimental results show FPETS utilizes the power of parallel computation and reaches a significant speed up of inference compared with state-of-the-art end-to-end TTS systems. Expand
Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data
  • Zhu Li, Yuqing Zhang, +4 authors Caixia Gong
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences. Expand
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
  • Yi Ren, Jinglin Liu, Zhou Zhao
  • Engineering, Computer Science
  • ArXiv
  • 2021
TLDR
PortaSpeech is proposed, a portable and high-quality generative text-to-speech model that outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M. Expand
Guided-TTS: Text-to-Speech with Untranscribed Speech
  • Heeseung Kim, Sungwon Kim, Sungroh Yoon
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
Guided-TTS is presented, a high-quality TTS model that learns to generate speech from untranscribed speech data and achieves comparable performance with the existing methods without any transcript for LJSpeech. Expand
LinearSpeech: Parallel Text-to-Speech with Linear Complexity
TLDR
This work proposes LinearSpeech, an efficient parallel text-to-speech model with memory and computational complexity O (N), and adds a novel positional encoding to standard and linear attention modules, which enable the model to learn the order of input sequence and synthesizing long mel-spectrograms. Expand
AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
  • Yuzi Yan, Xu Tan, +6 authors Tie-Yan Liu
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
AdaSpeech 3 is developed, an adaptive TTS system that fine-tunes a well-trained readingstyle TTS model for spontaneous-style speech and mine a spontaneous speech dataset to support this work and facilitate future research on spontaneous TTS. Expand
DurIAN: Duration Informed Attention Network for Speech Synthesis
TLDR
It is shown that proposed DurIAN system could generate highly natural speech that is on par with current state of the art end-to-end systems, while being robust and stable at the same time. Expand
VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis
  • Hui-Ling Lu, Zhiyong Wu, +4 authors H. Meng
  • Computer Science, Engineering
  • Interspeech 2021
  • 2021
TLDR
Experiments show that VAENARTTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models, and the proposed model is an end-to-end approach that does not require phoneme-level durations. Expand
Transformer-Based Text-to-Speech with Weighted Forced Attention
TLDR
The results of experiments indicate that the proposed Transformer using forced attention with a weighting factor of 0.5 outperforms other models, and removing the duration predictor from FastSpeech improves synthesis quality, although the proposed weighted forced attention does not improve synthesis stability. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Neural Speech Synthesis with Transformer Network
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand
Close to Human Quality TTS with Transformer
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations. Expand
Parallel Neural Text-to-Speech
TLDR
This work proposes a non-autoregressive seq2seq model that converts text to spectrogram and builds the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder. Expand
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditionalExpand
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
Almost Unsupervised Text to Speech and Automatic Speech Recognition
TLDR
An almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR and achieves 99.84% in terms of word level intelligible rate and 2.68 MOS on LJSpeech dataset. Expand
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Expand
Unit selection in a concatenative speech synthesis system using a large speech database
  • Andrew J. Hunt, A. Black
  • Computer Science
  • 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
  • 1996
TLDR
It is proposed that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. Expand
...
1
2
3
4
...