• Publications
  • Influence
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point. Expand
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. Expand
Fixing a Broken ELBO
This framework derives variational lower and upper bounds on the mutual information between the input and the latent variable, and uses these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy. Expand
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail. Expand
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task. Expand
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks. Expand
Trainable frontend for robust and far-field keyword spotting
This work introduces a novel frontend called per-channel energy normalization (PCEN), which uses an automatic gain control based dynamic compression to replace the widely used static compression in speech recognition. Expand
Deep Probabilistic Programming
Edward, a Turing-complete probabilistic programming language, is proposed, which makes it easy to fit the same model using a variety of composable inference methods, ranging from point estimation to variational inference to MCMC. Expand