PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

  title={PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS},
  author={Ye Jia and Heiga Zen and Jonathan Shen and Yu Zhang and Yonghui Wu},
This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a selfsupervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation… Expand

Figures and Tables from this paper

A Survey on Neural Speech Synthesis
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders. Expand
Translatotron 2: Robust direct speech-to-speech translation
Experimental results suggest that Translatotron 2 outperforms the original Translattron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. Expand
NWT: Towards natural audio-to-video generation with representation learning
A novel discrete variational autoencoder with adversarial loss, dVAE-Adv, is proposed, which learns a new discrete latent representation the authors call Memcodes, which is straightforward to implement, require no additional loss terms, are stable to train compared with other approaches, and show evidence of interpretability. Expand


SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences. Expand
Theoretical Limitations of Self-Attention in Neural Sequence Models
Across both soft and hard attention, strong theoretical limitations are shown of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. Expand
Universal Transformers
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed. Expand
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. Expand
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs. Expand
Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis
Cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pretrained BERT model, are used to augment the input of the Tacotron2 decoder. Expand
Parallel Tacotron: Non-Autoregressive and Controllable TTS
  • Isaac Elias, H. Zen, +4 authors Yonghui Wu
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. Expand
Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis
A sequence-to-sequence neural network which directly generates speech waveforms from text inputs, extending the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop, enabling parallel training and synthesis. Expand
Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model
It is shown that incorporating a BERT model in an RNN-based speech synthesis model — where the Bert model is pretrained on large amounts of unlabeled data, and fine-tuned to the speech domain — improves prosody and proposes a way of handling arbitrarily long sequences with BERT. Expand
Unified Mandarin TTS Front-end Based on Distilled BERT Model
A pre-trained language model (PLM) based model is proposed to simultaneously tackle the two most important tasks in TTS front-end, i.e., prosodic structure prediction (PSP) and grapheme-to-phoneme (G2P) conversion. Expand