• Publications
  • Influence
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation. Expand
Hierarchical Generative Modeling for Controllable Speech Synthesis
TLDR
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed. Expand
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
TLDR
This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the Framework. Expand
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
TLDR
Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. Expand
Parallel Tacotron: Non-Autoregressive and Controllable TTS
TLDR
A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. Expand
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
TLDR
Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Expand
In Teacher We Trust: Learning Compressed Models for Pedestrian Detection
TLDR
This work trains a model that contains $400\times$ fewer parameters than the large network while outperforming AlexNet on the Caltech Pedestrian Dataset and introduces a higher-dimensional hint layer to increase information flow. Expand
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
TLDR
Parallel Tacotron 2 is introduced, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. Expand
Modelling Intonation in Spectrograms for Neural Vocoder based Text-to-Speech
TLDR
Compared to the original model, the spectrogram extension gives better mean opinion scores in subjective listening tests, and it is shown that the intonation in the generated spectrograms match theintonation represented by the generated pitch curves. Expand
...
1
2
...