• Publications
  • Influence
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
On Training Targets for Supervised Speech Separation
Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets. Expand
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. Expand
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail. Expand
Complex Ratio Masking for Monaural Speech Separation
The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate. Expand
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task. Expand
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. Expand
Hierarchical Generative Modeling for Controllable Speech Synthesis
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed. Expand
Towards Scaling Up Classification-Based Speech Separation
This work proposes to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. Expand