Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

  title={Self-supervised Context-aware Style Representation for Expressive Speech Synthesis},
  author={Yihan Wu and Xi Wang and Shao Min Zhang and Lei He and Ruihua Song and Jianyun Nie},
Expressive speech synthesis, like audiobook synthesis, is still challenging for style representation learning and prediction. Deriving from reference audio or predicting style tags from text requires a huge amount of labeled data, which is costly to ac-quire and difficult to define and annotate accurately. In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner. It leverages an emotion lexicon and uses contrastive learning… 

Figures and Tables from this paper



Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis

This work introduces the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as “virtual” speaking style labels within Tacotron, and shows that the system can render text with more pitch and energy variation than two state-of-the-art baseline models.

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Expressive Text-to-Speech Using Style Tag

The relationship between linguistic embedding and speaking style domain is modeled, which enables the model to work even with style tags unseen during training, and shows that ST-TTS outperforms the existing expressive TTS model, Tacotron2-GST in speech quality and expressiveness.

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

The main goal of the method is to preserve the input content in the synthesized speech signal, which is measured by the word error rate (WER) and show substantial improvements over state-of-the-art unsupervised speech synthesis methods.

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

  • Guangzhi SunYu Zhang Yonghui Wu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
Experimental results show that the proposed sequential prior in a discrete latent space which can generate more naturally sounding samples significantly improves the naturalness in random sample generation and randomly sampling can be used as data augmentation to improve the ASR performance.

Neural TTS Stylization with Adversarial and Collaborative Games

This work introduces an end-to-end TTS model that offers enhanced content-style disentanglement ability and controllability, and achieves start-of-the-art results across multiple tasks, including style transfer (content and style swapping), emotion modeling, and identity transfer.

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Experiments show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

WavAugment is intro-duce, a time-domain data augmentation library which is adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure), and finds that applying augmentation only to the segments from which the CPC prediction is performed yields better results.

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.