ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

  author={Yi Ren and Ming Lei and Zhiying Huang and Shi-Rui Zhang and Qian Chen and Zhijie Yan and Zhou Zhao},
Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high… 

