• Corpus ID: 453615

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

@article{Chorowski2014EndtoendCS,
  title={End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results},
  author={Jan Chorowski and Dzmitry Bahdanau and Kyunghyun Cho and Yoshua Bengio},
  journal={ArXiv},
  year={2014},
  volume={abs/1412.1602}
}
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results… 

Figures and Tables from this paper

A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition
TLDR
This paper studies the RNN encoder-decoder approach for large vocabulary end-to-end speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequences of feature representations, from which a decoder recovers asequence of words.
On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition
TLDR
This paper presents a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy, and demonstrates that using multiple recurrent layers in the encoder can reduce the word error rate.
End-to-end attention-based distant speech recognition with Highway LSTM
TLDR
This paper proposes an end-to-end attention-based speech recognizer with multichannel input that performs sequence prediction directly at the character level and incorporates Highway long short-term memory (HLSTM) which outperforms previous models on AMI distant speech recognition task.
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
TLDR
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
TLDR
This work analyzes the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss and evaluates representations from different layers of the deep model.
Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR
TLDR
It is shown that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vector for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.
End-to-End Phoneme Recognition using Models from Semantic Image Segmentation
TLDR
The encoder-decoder architecture of U-Net is extended and it is shown it is capable of good performance in the acoustic modelling of a speech recognition system and the importance of the concatenation step is investigated.
Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture
TLDR
A novel architecture with its decoding approach for improving recurrent neural network-transducer (RNN-T) performance and integrating encoder-decoder-based sequence-to-sequence models (S2S) is presented, which makes streaming ASR practical.
End-to-End Speech Recognition Models
TLDR
This thesis proposes a novel approach to ASR with neural attention models and demonstrates the end-to-end speech recognition model, which can directly emit English/Chinese characters or even word pieces given the audio signal.
End-to-End Speech Recognition with Local Monotonic Attention
TLDR
Experimental results demonstrate that encoder-decoder based ASR with local monotonic attention could achieve significant performance improvements and reduce the computational complexity in comparison with the one that used the standard global attention architecture.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Deep Belief Networks for phone recognition
TLDR
Deep Belief Networks (DBNs) have recently proved to be very effective in a variety of machine learning problems and this paper applies DBNs to acous ti modeling.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
TLDR
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Global optimization of a neural network-hidden Markov model hybrid
TLDR
An original method for integrating artificial neural networks (ANN) with hidden Markov models (HMM) with results on speaker-independent recognition experiments using this integrated ANN-HMM system on the TIMIT continuous speech database are reported.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
TLDR
Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.
Bidirectional recurrent neural networks
TLDR
It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
Sequence-discriminative training of deep neural networks
TLDR
Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.
Deep Convolutional Neural Networks for Large-scale Speech Tasks
Neural networks for speech and sequence recognition
TLDR
This paper presents post-processors based on dynamic programming ANN/DP hybrids ANN/HMM Hybrids, and experiments on phoneme recognition with RBFs and online handwriting recognition experiments.
...
...