Controllable neural text-to-speech synthesis using intuitive prosodic features

  title={Controllable neural text-to-speech synthesis using intuitive prosodic features},
  author={T. Raitio and Ramya Rasipuram and Dan Castellani},
  • T. Raitio, Ramya Rasipuram, Dan Castellani
  • Published in INTERSPEECH 2020
  • Computer Science, Engineering
  • Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this work, we train a sequence-to-sequence neural network conditioned on acoustic speech features to… CONTINUE READING
    2 Citations

    Figures and Tables from this paper

    Speech Synthesis and Control Using Differentiable DSP
    • PDF
    Exemplar-Based Emotive Speech Synthesis


    Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities
    • 8
    • Highly Influential
    • PDF
    Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis
    • Younggun Lee, Taesu Kim
    • Computer Science, Engineering
    • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    • 2019
    • 43
    • PDF
    Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
    • 186
    • Highly Influential
    • PDF
    Tacotron: Towards End-to-End Speech Synthesis
    • 645
    • Highly Influential
    • PDF
    Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
    • Jonathan Shen, R. Pang, +10 authors Y. Wu
    • Computer Science
    • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    • 2018
    • 776
    • Highly Influential
    • PDF
    Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
    • 218
    • PDF
    Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis
    • 35
    • PDF
    WaveNet: A Generative Model for Raw Audio
    • 3,266
    • PDF