Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

  title={Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis},
  author={Shifeng Pan and Lei He},
  • Shifeng Pan, Lei He
  • Published 2021
  • Computer Science, Engineering
  • ArXiv
Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale. It does not require the target speakers to be experts in expressing all styles and to collect corresponding recordings for model training. However, the performances of existing style transfer methods are still far behind real application needs. The root causes are mainly twofold. Firstly, the style embedding extracted from single reference speech can hardly provide fine-grained… Expand

Figures and Tables from this paper

Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis
  • Songxiang Liu, Shan Yang, Dan Su, Dong Yu
  • Computer Science, Engineering
  • ArXiv
  • 2021
Experimental results are presented, showing that Referee outperforms a global-style-token (GST)-based baseline approach in CSST. Expand


Fine-grained robust prosody transfer for single-speaker neural text-to-speech
This work proposes decoupling of the reference signal alignment from the overall system, and incorporates a variational auto-encoder to further enhance the latent representation of prosody embeddings in a neural text-to-speech system. Expand
Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS
The results of subjective crowd evaluations confirming that the synthesized speech convincingly conveys the desired expressive styles and preserves a high level of quality are presented. Expand
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. Expand
Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis
  • Younggun Lee, Taesu Kim
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech and introducing the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks. Expand
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail. Expand
Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis
The Variational Autoencoder (VAE) is introduced to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner and shows good properties such as disentangling, scaling, and combination. Expand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
This paper proposes an adversarial cycle consistency training scheme with paired and unpaired triplets to ensure the use of information from all style dimensions and uses this method to transfer emotion from a dataset containing four emotions to a dataset with only a single emotion. Expand
Neural TTS Stylization with Adversarial and Collaborative Games
This work introduces an end-to-end TTS model that offers enhanced content-style disentanglement ability and controllability, and achieves start-of-the-art results across multiple tasks, including style transfer (content and style swapping), emotion modeling, and identity transfer. Expand
Neural Speech Synthesis with Transformer Network
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand