Corpus ID: 16979536

Deep Speech: Scaling up end-to-end speech recognition

@article{Hannun2014DeepSS,
  title={Deep Speech: Scaling up end-to-end speech recognition},
  author={Awni Y. Hannun and Carl Case and Jared Casper and Bryan Catanzaro and Gregory Frederick Diamos and Erich Elsen and Ryan J. Prenger and Sanjeev Satheesh and Shubho Sengupta and Adam Coates and A. Ng},
  journal={ArXiv},
  year={2014},
  volume={abs/1412.5567}
}
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. [...] Key Method We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard…Expand
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets. Expand
Recognizing Long-Form Speech Using Streaming End-to-End Models
TLDR
This work examines the ability of E2E models to generalize to unseen domains, and proposes two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. Expand
End-to-End Architectures for Speech Recognition
  • Y. Miao, Florian Metze
  • Computer Science
  • New Era for Robust Speech Recognition, Exploiting Deep Learning
  • 2017
TLDR
The EESEN framework, which combines connectionist-temporal-classification-based acoustic models with a weighted finite state transducer decoding setup, achieves state-of-the art word error rates, while at the same time drastically simplifying the ASR pipeline. Expand
Controlling the Noise Robustness of End-to-End Automatic Speech Recognition Systems
In this work, we propose a novel training scheme to modularize end-to-end systems. Our training scheme aims at altering the flow of information in an end-to-end system to use the kernels of thisExpand
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
TLDR
This paper presents the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome, and presents rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone C TC models. Expand
END-TO-END SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION
Speech recognition on large vocabulary and noisy corpora is challenging for computers. Recent advances have enabled speech recognition systems to be trained end-to-end, instead of relying on complexExpand
SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
TLDR
SpecSwap is presented, a simple data augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances that can be applied on Transformer-based networks for end-to-end speech recognition task. Expand
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly. Expand
Applications of deep learning to speech enhancement.
TLDR
This work proposes a model to perform speech dereverberation by estimating its spectral magnitude from the reverberant counterpart and proposes a method to prune those neurons away from the model without having an impact in performance, and compares this method to other methods in the literature. Expand
Noise-robust Attention Learning for End-to-End Speech Recognition
TLDR
Noise-robust attention learning (NRAL) is proposed which explicitly tells the attention mechanism where to "listen at" in a sequence of noisy speech features, which effectively improves the noise robustness of the end-to-end ASR model. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets. Expand
Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition
  • D. Ellis, N. Morgan
  • Computer Science
  • 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258)
  • 1999
TLDR
There appears to be an optimal ratio of training patterns to parameters of around 25:1 in these circumstances, and doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems. Expand
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs
TLDR
This paper demonstrates that a straightforward recurrent neural network architecture can achieve a high level of accuracy and proposes and evaluates a modified prefix-search decoding algorithm that enables first-pass speech recognition with a langu age model, completely unaided by the cumbersome infrastructure of HMM-based systems. Expand
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs. Expand
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of theExpand
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
Sequence-discriminative training of deep neural networks
TLDR
Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task. Expand
Deep Neural Networks for Acoustic Modeling in Speech Recognition
TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition. Expand
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
TLDR
This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls. Expand
...
1
2
3
4
5
...