• Corpus ID: 4425995

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

@article{SkerryRyan2018TowardsEP,
  title={Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron},
  author={R. J. Skerry-Ryan and Eric Battenberg and Ying Xiao and Yuxuan Wang and Daisy Stanton and Joel Shor and Ron J. Weiss and Robert A. J. Clark and Rif A. Saurous},
  journal={ArXiv},
  year={2018},
  volume={abs/1803.09047}
}
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used… 
Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis
  • Younggun Lee, Taesu Kim
  • Computer Science, Engineering
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech and introducing the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.
Fine-grained robust prosody transfer for single-speaker neural text-to-speech
TLDR
This work proposes decoupling of the reference signal alignment from the overall system, and incorporates a variational auto-encoder to further enhance the latent representation of prosody embeddings in a neural text-to-speech system.
Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
TLDR
Experimental results show that DaftExprt significantly outperforms strong baselines on prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models, and results indicate that adversarial training effectively discards speaker identity information from the prosody representation, which ensures Daft- exprt will consistently generate speech with the desired voice.
Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis
TLDR
The obtained results show that adding multiclass N-pair loss based deep metric learning in the training process improves expressivity in the desired speaker's voice.
Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis
TLDR
Cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pretrained BERT model, are used to augment the input of the Tacotron2 decoder.
ADEPT: A Dataset for Evaluating Prosody Transfer
Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample.
Controllable neural text-to-speech synthesis using intuitive prosodic features
TLDR
This work trains a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions and shows that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles.
Speech Bert Embedding for Improving Prosody in Neural TTS
TLDR
Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment- level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech.
IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion
TLDR
IQDubbing is presented to solve the problem of expressive voice conversion by leveraging the recent advances in discrete self-supervised speech representation (DSSR) to model prosody, and two kinds of prosody filters are proposed to sample prosody from the prosody vector.
Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition
TLDR
This paper proposes to mitigate the problem by using the unmatched text and speech during training, and using the ASR accuracy of an end-to-end ASR model to guide the training procedure, and results show that with the guidance of end- to- end ASR, both theASR accuracy and the listener preference of the speech generated by TTS model are improved.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Uncovering Latent Style Factors for Expressive Speech Synthesis
TLDR
This preliminary study introduces the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model, and shows that without annotation data or an explicit supervision signal, this approach can automatically learn a variety of prosodic variations in a purely data-driven way.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
TLDR
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
Experiments with signal-driven symbolic prosody for statistical parametric speech synthesis
TLDR
Objective evaluation performed on a test set of the corpora shows that the proposed systems improve the prediction accuracy of phonemes duration and F0 trajectories.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation
TLDR
This work adopts probabilistic linear discriminant analysis (PLDA) for voice conversion and adopts i-vector method to voice conversion, which requires neither parallel utterances, transcriptions nor time alignment procedures at any stage.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Unsupervised clustering of emotion and voice styles for expressive TTS
  • F. Eyben, S. Buchholz, +4 authors K. Knill
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
TLDR
Initial investigations into improving expressiveness for statistical speech synthesis systems are described and it is shown that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.
On Using Backpropagation for Speech Texture Generation and Voice Conversion
TLDR
A proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances.
Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine
TLDR
This paper presents a voice conversion method that does not use any parallel data while training the model, and produces results similar to those of the popular conventional Gaussian mixture models-based method that used parallel data in subjective and objective criteria.
...
1
2
3
4
...