End-to-end attention-based large vocabulary speech recognition

@article{Bahdanau2016EndtoendAL,
  title={End-to-end attention-based large vocabulary speech recognition},
  author={Dzmitry Bahdanau and Jan Chorowski and Dmitriy Serdyuk and Philemon Brakel and Yoshua Bengio},
  journal={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2016},
  pages={4945-4949}
}
Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that… 

Figures and Tables from this paper

An Analysis of Decoding for Attention-Based End-to-End Mandarin Speech Recognition

TLDR
This paper investigates the decode process for attention-based Mandarin models using syllable and character as acoustic modeling units and discusses how to combine word information into the decoding process and conducts a detailed analysis on various factors that affect the performance of decoding.

Recent progress in deep end-to-end models for spoken language processing

TLDR
Progress within the IBM Watson Multimodal Group on end-to-end models for spoken language processing is presented and a detailed analysis of some salient characteristics of these models compared with the state-of-the-art HMM-DNN hybrid systems are presented.

END-TO-END SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION

TLDR
Results show that the use of convolutional input layers is advantages, when compared to dense ones, and suggest that the number of recurrent layers has a significant impact on the results.

On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition

TLDR
This paper presents a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy, and demonstrates that using multiple recurrent layers in the encoder can reduce the word error rate.

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

TLDR
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

End-to-end ASR-free keyword search from speech

TLDR
This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster.

End-to-End Online Speech Recognition with Recurrent Neural Networks

TLDR
An efficient GPUbased RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training, and an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window.

Attention-Based End-to-End Speech Recognition on Voice Search

TLDR
This paper uses character embedding to deal with the large vocabulary of Mandarin speech recognition and compares two attention mechanisms and uses attention smoothing to cover long context in the attention model.

End-to-End Architectures for Speech Recognition

  • Y. MiaoFlorian Metze
  • Computer Science
    New Era for Robust Speech Recognition, Exploiting Deep Learning
  • 2017
TLDR
The EESEN framework, which combines connectionist-temporal-classification-based acoustic models with a weighted finite state transducer decoding setup, achieves state-of-the art word error rates, while at the same time drastically simplifying the ASR pipeline.

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
...

References

SHOWING 1-10 OF 35 REFERENCES

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Speech recognition with deep recurrent neural networks

TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

TLDR
Initial results demonstrate that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

Attention-Based Models for Speech Recognition

TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

TLDR
This paper demonstrates that a straightforward recurrent neural network architecture can achieve a high level of accuracy and proposes and evaluates a modified prefix-search decoding algorithm that enables first-pass speech recognition with a langu age model, completely unaided by the cumbersome infrastructure of HMM-based systems.

Towards End-To-End Speech Recognition with Recurrent Neural Networks

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the

Sequence to Sequence Learning with Neural Networks

TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Deep Speech: Scaling up end-to-end speech recognition

TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

Review of Neural Networks for Speech Recognition

TLDR
Further work is necessary for large-vocabulary continuous-speech problems, to develop training algorithms that progressively build internal word models, and to develop compact VLSI neural net hardware.