• Corpus ID: 239024330

Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

  author={Fengyu Yang and Jian Luan and Yujun Wang},
Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all… 

