• Publications
  • Influence
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition. Expand
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
Pointer Networks
A new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence using a recently proposed mechanism of neural attention, called Ptr-Nets, which improves over sequence-to-sequence with input attention, but also allows it to generalize to variable size output dictionaries. Expand
Adversarial Autoencoders
This paper shows how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization, and performed experiments on MNIST, Street View House Numbers and Toronto Face datasets. Expand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Deep Neural Networks for Acoustic Modeling in Speech Recognition
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition. Expand
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
This work proposes a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Expand
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditionalExpand
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of theExpand
Hybrid speech recognition with Deep Bidirectional LSTM
The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. Expand