ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

  title={ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech},
  author={Yi Ren and Ming Lei and Zhiying Huang and Shi-Rui Zhang and Qian Chen and Zhijie Yan and Zhou Zhao},
Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high… 

Figures and Tables from this paper


Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis
This work analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level Prosody features.
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
PortaSpeech is proposed, a portable and high-quality generative text-to-speech model that outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis
This work focuses on phone-level prosody modelling where a Gaussian mixture model(GMM) based mixture density network is introduced and it is demonstrated that GMM can better model the phone- level prosody than a single Gaussian.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.
Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis
This work introduces the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as “virtual” speaking style labels within Tacotron, and shows that the system can render text with more pitch and energy variation than two state-of-the-art baseline models.
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Glow-TTS is proposed, a flow-based generative model for parallel TTS that does not require any external aligner and obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality.
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
This work proposes DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model, a parameterized Markov chain that iteratively converts the noise into mel-spectrogram conditioned on the music score that outperforms state-of-the-art SVS work.