• Publications
  • Influence
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMMExpand
  • 5,924
  • 236
  • PDF
Adversarial Autoencoders
In this paper, we propose the “adversarial autoencoder” (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variationalExpand
  • 831
  • 153
  • PDF
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
  • 619
  • 151
  • PDF
Deep Neural Networks for Acoustic Modeling in Speech Recognition
Most current speech recognition systems use hidden Markov models ( HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fitsExpand
  • 1,939
  • 141
  • PDF
Pointer Networks
We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problemsExpand
  • 1,011
  • 125
  • PDF
Tacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requiresExpand
  • 525
  • 119
  • PDF
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of theExpand
  • 1,388
  • 109
  • PDF
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach toExpand
  • 933
  • 105
  • PDF
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditionalExpand
  • 948
  • 103
  • PDF
Hybrid speech recognition with Deep Bidirectional LSTM
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied onExpand
  • 985
  • 100
  • PDF