Controllable Neural Prosody Synthesis

  title={Controllable Neural Prosody Synthesis},
  author={Max Morrison and Zeyu Jin and Justin Salamon and Nicholas J. Bryan and Gautham J. Mysore},
Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a… 

Figures and Tables from this paper

Context-Aware Prosody Correction for Text-Based Speech Editing
This work proposes a new context-aware method for more natural sounding text-based editing of speech that uses a series of neural networks to generate salient prosody features that are dependent on the prosody of speech surrounding the edit and amenable to fine-grained user control.
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis
A model is proposed that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: F0, energy and duration, which provides more interpretable, temporally-precise, and disentangled control when automatically predicting the acoustic features from text.
A Survey on Neural Speech Synthesis
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.
Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation
This paper investigates whether one can control prosody directly from the input text, in order to code information related to contrastive focus which emphasizes a specific word that is contrary to the presuppositions of the interlocutor.
Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet
This paper proposes Controllable LPCNet (CLPCNet), an improved LPC net vocoder capable of pitch-shifting and time-stretching of speech and shows that CLPCNet performs pitch-Shifting of speech on unseen datasets with high accuracy relative to prior neural methods.
MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
This work introduces MIDI-DDSP, a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control, and opens the door to assistive tools to empower individuals across a diverse range of musical experience.
Review of end-to-end speech synthesis technology based on deep learning
The opensource speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks are summarized, and some commonly used subjective and objective speech quality evaluation method are introduced.
The Sillwood Technologies System for the VoiceMOS Challenge 2022
This system is based on pre-trained self-supervised waveform prediction models, while improving its generalisation ability through stochastic weight averaging, and uses influence functions to identity possible low-quality data within the training set to further increase the model’s performance for the OOD track.
Chunked Autoregressive GAN for Conditional Waveform Synthesis
This paper's proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for realtime or interactive applications, and maintains or improves subjective quality.


Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis
  • Younggun Lee, Taesu Kim
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech and introducing the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Using generative modelling to produce varied intonation for speech synthesis
This work uses variational autoencoders (VAEs) which explicitly place the most "average" data close to the mean of the Gaussian prior to propose that by moving towards the tails of the prior distribution, the model will transition towards generating more idiosyncratic, varied renditions.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Diphone synthesis using an overlap-add technique for speech waveforms concatenation
  • F. Charpentier, M. Stella
  • Computer Science
    ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1986
A new method is presented for text-to-speech synthesis using diphones, based on a representation of the speech signal by its short-time Fourier transform at a pitch-synchronous sampling rate.
Semi-Supervised Generative Modeling for Controllable Speech Synthesis
A novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models that is able to reliably discover and control important but rarely labelled attributes of speech.
Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by
Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation
A quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) Vocoder is proposed.
Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis
This work introduces the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as “virtual” speaking style labels within Tacotron, and shows that the system can render text with more pitch and energy variation than two state-of-the-art baseline models.
WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications
A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.