Corpus ID: 237503166

Controllable cross-speaker emotion transfer for end-to-end speech synthesis

@article{Li2021ControllableCE,
  title={Controllable cross-speaker emotion transfer for end-to-end speech synthesis},
  author={Tao Li and Xinsheng Wang and Qicong Xie and Zhichao Wang and Lei Xie},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.06733}
}
  • Tao Li, Xinsheng Wang, +2 authors Lei Xie
  • Published 14 September 2021
  • Computer Science, Engineering
  • ArXiv
The cross-speaker emotion transfer task in text-tospeech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity information of the source speaker could also affect the synthesized results, resulting in the issue of speaker leakage, i.e., synthetic speech may have the voice identity of the source speaker rather than the target speaker… Expand

Figures and Tables from this paper

Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
  • Pengfei Wu, Junjie Pan, +4 authors Zejun Ma
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
Experimental results show that the proposed cross-speaker emotion transfer method outperforms the multi-reference based baseline in terms of timbre similarity, stability and emotion perceive evaluations. Expand

References

SHOWING 1-10 OF 49 REFERENCES
Controllable Emotion Transfer For End-to-End Speech Synthesis
TLDR
The synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners. Expand
Multi-speaker Emotional Acoustic Modeling for CNN-based Speech Synthesis
TLDR
Experimental results have demonstrated that the multi-speaker emotional speech synthesis approach using trainable speaker embedding and emotion representation from mel spectrogram achieves higher performance when compared with other approaches in terms of naturalness, speaker similarity, and emotion similarity. Expand
Emotional transplant in statistical speech synthesis based on emotion additive model
TLDR
Experimental results show that the proposed method performs emotional speech synthesis with reasonable emotions and high speech quality. Expand
An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis
TLDR
This letter proposes an effective way of generating emotion embedding vectors by utilizing the trained GSTs, and confirms that the proposed controlled weight-based method is superior to the conventional emotion label-based methods in terms of perceptual quality and emotion classification accuracy. Expand
IMPROVING LATENT REPRESENTATION FOR END TO END MULTISPEAKER EXPRESSIVE TEXT TO SPEECH SYSTEM
TLDR
The obtained results show that adding multiclass N-pair loss based deep metric learning in training process improves expressivity in the desired speaker's voice. Expand
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation. Expand
Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis
TLDR
The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning along with IAF model improves the transfer of expressivity in the desired speaker's voice in synthesized speech. Expand
End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training
TLDR
An end-to-end emotional speech synthesis (ESS) method which adopts global style tokens (GSTs) for semi-supervised training based on the GST-Tacotron framework that outperforms the conventional Tacotron model when only 5% of training data has emotion labels. Expand
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
TLDR
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail. Expand
Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis
TLDR
This paper focuses on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. Expand
...
1
2
3
4
5
...