• Corpus ID: 16979536

Deep Speech: Scaling up end-to-end speech recognition

@article{Hannun2014DeepSS,
  title={Deep Speech: Scaling up end-to-end speech recognition},
  author={Awni Y. Hannun and Carl Case and Jared Casper and Bryan Catanzaro and Gregory Frederick Diamos and Erich Elsen and Ryan J. Prenger and Sanjeev Satheesh and Shubho Sengupta and Adam Coates and A. Ng},
  journal={ArXiv},
  year={2014},
  volume={abs/1412.5567}
}
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. [] Key Method We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard…

Figures and Tables from this paper

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.

Recognizing Long-Form Speech Using Streaming End-to-End Models

TLDR
This work examines the ability of E2E models to generalize to unseen domains, and proposes two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances.

End-to-End Architectures for Speech Recognition

  • Y. MiaoFlorian Metze
  • Computer Science
    New Era for Robust Speech Recognition, Exploiting Deep Learning
  • 2017
TLDR
The EESEN framework, which combines connectionist-temporal-classification-based acoustic models with a weighted finite state transducer decoding setup, achieves state-of-the art word error rates, while at the same time drastically simplifying the ASR pipeline.

Controlling the Noise Robustness of End-to-End Automatic Speech Recognition Systems

TLDR
This work applies a novel training scheme to extract the noise reduction capabilities from a noise-robust automatic speech recognition (ASR) system and implements a speech enhancer from it, which can be integrated into an ASR system as front-end, is trainable, and reduces background noise.

END-TO-END SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION

TLDR
Results show that the use of convolutional input layers is advantages, when compared to dense ones, and suggest that the number of recurrent layers has a significant impact on the results.

SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition

TLDR
SpecSwap is presented, a simple data augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances that can be applied on Transformer-based networks for end-to-end speech recognition task.

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Noise-robust Attention Learning for End-to-End Speech Recognition

TLDR
Noise-robust attention learning (NRAL) is proposed which explicitly tells the attention mechanism where to "listen at" in a sequence of noisy speech features, which effectively improves the noise robustness of the end-to-end ASR model.

End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition

TLDR
This paper proposes introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources.

Noise-robust Attention Learning for End-to-End Speech Recognition

TLDR
Noise-robust attention learning (NRAL) is proposed which explicitly tells the attention mechanism where to “listen at” in a sequence of noisy speech features, which effectively improves the noise robustness of the end-to-end ASR model.
...

References

SHOWING 1-10 OF 49 REFERENCES

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.

Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition

  • D. EllisN. Morgan
  • Computer Science
    1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258)
  • 1999
TLDR
There appears to be an optimal ratio of training patterns to parameters of around 25:1 in these circumstances, and doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems.

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

TLDR
This paper demonstrates that a straightforward recurrent neural network architecture can achieve a high level of accuracy and proposes and evaluates a modified prefix-search decoding algorithm that enables first-pass speech recognition with a langu age model, completely unaided by the cumbersome infrastructure of HMM-based systems.

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

Towards End-To-End Speech Recognition with Recurrent Neural Networks

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the

Improvements to Deep Convolutional Neural Networks for LVCSR

TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced.

Sequence to Sequence Learning with Neural Networks

TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Sequence-discriminative training of deep neural networks

TLDR
Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.

Deep Neural Networks for Acoustic Modeling in Speech Recognition

TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

TLDR
This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.