• Corpus ID: 244116965

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

  title={Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data},
  author={Zhu Li and Yuqing Zhang and Mengxi Nie and Ming Yan and Mengnan He and Ruixiong Zhang and Caixia Gong},
Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone… 

Figures and Tables from this paper


Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model
It is shown that incorporating a BERT model in an RNN-based speech synthesis model — where the Bert model is pretrained on large amounts of unlabeled data, and fine-tuned to the speech domain — improves prosody and proposes a way of handling arbitrarily long sequences with BERT.
Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis
Cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pretrained BERT model, are used to augment the input of the Tacotron2 decoder.
Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis
It is hypothesized that the text embeddings contain information about the semantics of the phrase and the importance of each word, which should help TTS systems produce more natural prosody and pronunciation.
Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS
This study investigates linguistic features and Bert-derived information to improve the prosody of the Mandarin Chinese TTS, and finds the model with additional character embeddings from Bert is the best, which outperforms the baseline by 0.17 MOS gain.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.
Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
This work aims to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training by investigating the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder.
Unified Mandarin TTS Front-end Based on Distilled BERT Model
A pre-trained language model (PLM) based model is proposed to simultaneously tackle the two most important tasks in TTS front-end, i.e., prosodic structure prediction (PSP) and grapheme-to-phoneme (G2P) conversion.
Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi
The Montreal Forced Aligner (MFA) is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on new data, as well as incorporating improved architecture (triphone acoustic models and speaker adaptation), and other features.
AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines
A large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi- Speakers Text-to-Speech systems and a robust synthesis model that is able to achieve zero-shot voice cloning is presented.