• Corpus ID: 239024330

Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

  title={Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation},
  author={Fengyu Yang and Jian Luan and Yujun Wang},
Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all… 

Figures and Tables from this paper


An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis
This letter proposes an effective way of generating emotion embedding vectors by utilizing the trained GSTs, and confirms that the proposed controlled weight-based method is superior to the conventional emotion label-based methods in terms of perceptual quality and emotion classification accuracy.
End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training
An end-to-end emotional speech synthesis (ESS) method which adopts global style tokens (GSTs) for semi-supervised training based on the GST-Tacotron framework that outperforms the conventional Tacotron model when only 5% of training data has emotion labels.
Emotional Speech Synthesis with Rich and Granularized Control
An inter-to-intra emotional distance ratio algorithm is introduced to the embedding vectors that can minimize the distance to the target emotion category while maximizing itsdistance to the other emotion categories.
Multi-Speaker Emotional Speech Synthesis with Fine-Grained Prosody Modeling
  • Chunhui Lu, Xue Wen, Ruolan Liu, Xiao Chen
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
This work presents an end-to-end system that learns emotion classes from just two speakers then generalizes these classes to other speakers from whom no emotional data was seen, achieving higher ratings in naturalness and expressiveness, while retaining comparable speaker similarity ratings.
Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis
By comparing DNN-based speech synthesizers that utilize different emotional representations, this paper assesses the impact of these representations and design decisions on human emotion recognition rates, perceived emotional strength, and subjective speech quality.
Emotional End-to-End Neural Speech Synthesizer
An emotional speech synthesizer based on the recent end-to-end neural model, named Tacotron, which suffers from the exposure bias problem and irregularity of the attention alignment is introduced and utilization of context vector and residual connection at recurrent neural networks (RNNs).
Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis
This paper proposes a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS, and investigates two methods of context aggregation.
Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis
The Variational Autoencoder (VAE) is introduced to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner and shows good properties such as disentangling, scaling, and combination.
VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention
This paper proposes VARA-TTS1, a nonautoregressive (non-AR) end-to-end text-tospeech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual- to-acoustic alignment layer-wisely and outperforms the use of only a single attention layer in robustness.
Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
Experiments show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.