• Corpus ID: 11590585

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

@article{Amodei2016DeepS2,
  title={Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin},
  author={Dario Amodei and Sundaram Ananthanarayanan and Rishita Anubhai and Jin Bai and Eric Battenberg and Carl Case and Jared Casper and Bryan Catanzaro and Jingdong Chen and Mike Chrzanowski and Adam Coates and Gregory Frederick Diamos and Erich Elsen and Jesse Engel and Linxi (Jim) Fan and Christopher Fougner and Awni Y. Hannun and Billy Jun and Tony Han and Patrick LeGresley and Xiangang Li and Libby Lin and Sharan Narang and A. Ng and Sherjil Ozair and Ryan J. Prenger and Sheng Qian and Jonathan Raiman and Sanjeev Satheesh and David Seetapun and Shubho Sengupta and Anuroop Sriram and Chong-Jun Wang and Yi Wang and Zhiqian Wang and Bo Xiao and Yan Xie and Dani Yogatama and Junni Zhan and Zhenyao Zhu},
  journal={ArXiv},
  year={2016},
  volume={abs/1512.02595}
}
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This… 
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin
TLDR
This work explores the applicability of RNA in Mandarin Chinese and presents four effective extensions: in the encoder, the temporal down-sampling is redesigned and a powerful convolutional structure is introduced and in the decoder, a regularizer is utilized to smooth the output distribution and conduct joint training with a language model.
End-to-End Neural Speech Synthesis
In recent years, end-to-end neural networks have become the state of the art for speech recognition tasks and they are now widely deployed in industry (Amodei et al., 2016). Naturally, this has led
Deep Language: a comprehensive deep learning approach to end-to-end language recognition
TLDR
It is shown that an end-to-end deep learning system can be used to recognize language from speech utterances with various lengths and a combination of three deep architectures: feed-forward network, convolutional network and recurrent network can achieve the best performance compared to other network designs.
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
TLDR
Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4-11x faster for decoding than similar systems (e.g. ESPNET).
An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech
TLDR
This paper extends the model to enable dynamic tracking of the language within an utterance, and proposes a training procedure that takes advantage of a newly created mixed-language speech corpus.
Recognizing Long-Form Speech Using Streaming End-to-End Models
TLDR
This work examines the ability of E2E models to generalize to unseen domains, and proposes two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances.
Language Independent End-to-End Architecture For Joint Language and Speech Recognition
End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation
End-to-end recognition of streaming Japanese speech using CTC and local attention
TLDR
This paper explores the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention.
END-TO-END SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION
Speech recognition on large vocabulary and noisy corpora is challenging for computers. Recent advances have enabled speech recognition systems to be trained end-to-end, instead of relying on complex
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 85 REFERENCES
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
A big data approach to acoustic model training corpus selection
TLDR
This paper proposes a new approach to constructing large high quality unsupervised sets to train DNN models for large vocabulary speech recognition and shows that this approach yields models with approximately 18K context dependent states that achieve 10% relative improvement in large vocabulary dictation and voice-search systems for Brazilian Portuguese, French, Italian and Russian languages.
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs
TLDR
This paper demonstrates that a straightforward recurrent neural network architecture can achieve a high level of accuracy and proposes and evaluates a modified prefix-search decoding algorithm that enables first-pass speech recognition with a langu age model, completely unaided by the cumbersome infrastructure of HMM-based systems.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Lexicon-Free Conversational Speech Recognition with Neural Networks
TLDR
An approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks.
Listen, Attend and Spell
TLDR
A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models.
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
...
1
2
3
4
5
...