Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

  title={Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data},
  author={Zhu Li and Yuqing Zhang and Mengxi Nie and Ming Yan and Mengnan He and Ruixiong Zhang and Caixia Gong},
Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone… 

