End-to-end attention-based large vocabulary speech recognition

@article{Bahdanau2016EndtoendAL,
  title={End-to-end attention-based large vocabulary speech recognition},
  author={Dzmitry Bahdanau and Jan Chorowski and Dmitriy Serdyuk and Philemon Brakel and Yoshua Bengio},
  journal={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2016},
  pages={4945-4949}
}
Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that… Expand
An Analysis of Decoding for Attention-Based End-to-End Mandarin Speech Recognition
TLDR
This paper investigates the decode process for attention-based Mandarin models using syllable and character as acoustic modeling units and discusses how to combine word information into the decoding process and conducts a detailed analysis on various factors that affect the performance of decoding. Expand
Recent progress in deep end-to-end models for spoken language processing
TLDR
Progress within the IBM Watson Multimodal Group on end-to-end models for spoken language processing is presented and a detailed analysis of some salient characteristics of these models compared with the state-of-the-art HMM-DNN hybrid systems are presented. Expand
END-TO-END SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION
Speech recognition on large vocabulary and noisy corpora is challenging for computers. Recent advances have enabled speech recognition systems to be trained end-to-end, instead of relying on complexExpand
On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition
TLDR
This paper presents a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy, and demonstrates that using multiple recurrent layers in the encoder can reduce the word error rate. Expand
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
TLDR
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources. Expand
End-to-end ASR-free keyword search from speech
TLDR
This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster. Expand
End-to-end ASR-free keyword search from speech
TLDR
This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster. Expand
End-to-End Online Speech Recognition with Recurrent Neural Networks
TLDR
An efficient GPUbased RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training, and an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. Expand
Attention-Based End-to-End Speech Recognition on Voice Search
TLDR
This paper uses character embedding to deal with the large vocabulary of Mandarin speech recognition and compares two attention mechanisms and uses attention smoothing to cover long context in the attention model. Expand
End-to-End Architectures for Speech Recognition
  • Y. Miao, Florian Metze
  • Computer Science
  • New Era for Robust Speech Recognition, Exploiting Deep Learning
  • 2017
TLDR
The EESEN framework, which combines connectionist-temporal-classification-based acoustic models with a weighted finite state transducer decoding setup, achieves state-of-the art word error rates, while at the same time drastically simplifying the ASR pipeline. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly. Expand
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. Expand
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results
TLDR
Initial results demonstrate that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset. Expand
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate. Expand
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs
TLDR
This paper demonstrates that a straightforward recurrent neural network architecture can achieve a high level of accuracy and proposes and evaluates a modified prefix-search decoding algorithm that enables first-pass speech recognition with a langu age model, completely unaided by the cumbersome infrastructure of HMM-based systems. Expand
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of theExpand
Listen, Attend and Spell
TLDR
A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Expand
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition. Expand
...
1
2
3
4
...