• Corpus ID: 1166498

Towards End-To-End Speech Recognition with Recurrent Neural Networks

@inproceedings{Graves2014TowardsES,
  title={Towards End-To-End Speech Recognition with Recurrent Neural Networks},
  author={Alex Graves and Navdeep Jaitly},
  booktitle={ICML},
  year={2014}
}
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. [] Key Result Combining the network with a baseline system further reduces the error rate to 6.7%.

Figures and Tables from this paper

END-TO-END SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION
TLDR
Results show that the use of convolutional input layers is advantages, when compared to dense ones, and suggest that the number of recurrent layers has a significant impact on the results.
End-to-End Deep Neural Network for Automatic Speech Recognition
TLDR
An end-to-end deep learning system that utilizes mel-filter bank features to directly output to spoken phonemes without the need of a traditional Hidden Markov Model for decoding is implemented.
Lexicon-Free Conversational Speech Recognition with Neural Networks
TLDR
An approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks.
A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition
TLDR
This paper studies the RNN encoder-decoder approach for large vocabulary end-to-end speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequences of feature representations, from which a decoder recovers asequence of words.
End-to-End Online Speech Recognition with Recurrent Neural Networks
TLDR
An efficient GPUbased RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training, and an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window.
Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer
TLDR
This work investigates training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T) and finds that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors.
Automatic Speech Recognition using different Neural Network Architectures – A Survey
TLDR
A comparative study regarding the advantages of the architectures discussed during the survey with respect to Word Error Rate, Phone Error Rate etc. in the area of Automatic Speech Recognition (ASR) is concluded.
End to End Speech Recognition System
TLDR
A end to end speech recognition system that directly transcribes the audio data with text/phonemes is explained and the system tries to replace the conventional speech recognition pipeline by a single recurrent neural network (RNN) architecture based on the combination of a deep bidirectional LSTM recurrent Neural network architecture and the Connectionist Temporal Classification objective function.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
Towards an end-to-end speech recognizer for Portuguese using deep neural networks
TLDR
This first effort shows that an all-neural highperformance speech recognition system for PT-BR is feasible and achieves a label error rate about 17% higher than commercial systems with a language model.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 23 REFERENCES
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
From speech to letters - using a novel neural network architecture for grapheme based ASR
TLDR
This work investigates a novel ASR approach using Bidirectional Long Short-Term Memory Recurrent Neural Networks and Connectionist Temporal Classification, which is capable of transcribing graphemes directly and yields results highly competitive with phoneme transcription.
Deep Neural Networks for Acoustic Modeling in Speech Recognition
TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.
Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition
TLDR
This paper reports results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously, and outperforms the best Gaussian Mixture Model Hidden Markov Model baseline.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
TLDR
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Bidirectional recurrent neural networks
TLDR
It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
Connectionist Speech Recognition: A Hybrid Approach
From the Publisher: Connectionist Speech Recognition: A Hybrid Approach describes the theory and implementation of a method to incorporate neural network approaches into state-of-the-art continuous
Open vocabulary speech recognition with flat hybrid models
TLDR
It is demonstrated that, by using a simple flat hybrid model, a well-optimized state-ofthe-art speech recognition system over a wide range of out-of-vocabulary rates can be significantly improved.
Supervised Sequence Labelling with Recurrent Neural Networks
  • A. Graves
  • Computer Science
    Studies in Computational Intelligence
  • 2008
TLDR
A new type of output layer that allows recurrent networks to be trained directly for sequence labelling tasks where the alignment between the inputs and the labels is unknown, and an extension of the long short-term memory network architecture to multidimensional data, such as images and video sequences.
...
1
2
3
...