Fast and accurate recurrent neural network acoustic models for speech recognition
@inproceedings{Sak2015FastAA, title={Fast and accurate recurrent neural network acoustic models for speech recognition}, author={Hasim Sak and Andrew W. Senior and Kanishka Rao and Françoise Beaufays}, booktitle={INTERSPEECH}, year={2015} }
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. [] Key Method We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
377 Citations
Deep LSTM for Large Vocabulary Continuous Speech Recognition
- Computer ScienceArXiv
- 2017
This work introduces a training framework with layer-wise training and exponential moving average methods for deeper LSTM models, and introduces the novel transfer learning strategy with segmental Minimum Bayes-Risk, which makes it possible that training with only a small part of dataset could outperform full dataset training from the beginning.
Dynamic Frame Skipping for Fast Speech Recognition in Recurrent Neural Network Based Acoustic Models
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
A novel recurrent neural network architecture called Skip-RNN is proposed, which dynamically skips speech frames that are less important, which can accelerate acoustic model computation by up to 2.4 times without any noticeable degradation in transcription accuracy.
Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model
- Computer ScienceArXiv
- 2017
A novel frame retaining method which is applied in decoding and could reduces the time consumption of both training and decoding of long short-term memory recurrent neural networks.
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
- Computer ScienceINTERSPEECH
- 2017
It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode.
Lower Frame Rate Neural Network Acoustic Models
- Computer ScienceINTERSPEECH
- 2016
On a large vocabulary Voice Search task, it is shown that with conventional models, one can slow the frame rate to 40ms while improving WER by 3% relative over a CTC-based model, thus improving overall system speed.
The microsoft 2016 conversational speech recognition system
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
Microsoft's conversational speech recognition system is described, in which recent developments in neural-network-based acoustic and language modeling are combined to advance the state of the art on the Switchboard recognition task.
Long short-term memory recurrent neural network architectures for Urdu acoustic modeling
- Computer ScienceInt. J. Speech Technol.
- 2019
LSTM architectures were compared with gated recurrent unit (GRU) based architectures and it was found that LSTM has an advantage over GRU.
Simplifying long short-term memory acoustic models for fast training and decoding
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
To accelerate decoding of LSTMs, it is proposed to apply frame skipping during training, and frame skipping and posterior copying (FSPC) during decoding to resolve two challenges faced by LSTM models: high model complexity and poor decoding efficiency.
On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
This paper presents a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy, and demonstrates that using multiple recurrent layers in the encoder can reduce the word error rate.
On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
This work presents a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices and finds that the proposed technique allows us to reduce the size of the authors' Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.
References
SHOWING 1-10 OF 29 REFERENCES
Learning acoustic frame labeling for speech recognition with recurrent neural networks
- Computer Science, Physics2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
It is shown that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTm RNN model trained with CE using HMM state alignments, and the effect of sequence discriminative training on these models is shown.
Context dependent phone models for LSTM RNN acoustic modelling
- Computer Science2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
This work argues that using multi-state HMMs with LSTM RNN acoustic models is an unnecessary vestige of GMM-HMM and DNN- HMM modelling and shows that minimum-duration modelling can lead to improved results.
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
- Computer ScienceINTERSPEECH
- 2014
The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance.
Sequence discriminative distributed training of long short-term memory recurrent neural networks
- Computer ScienceINTERSPEECH
- 2014
This paper compares two sequence discriminative criteria – maximum mutual information and state-level minimum Bayes risk, and investigates a number of variations of the basic training strategy to better understand issues raised by both the sequential model, and the objective function.
Speech recognition with deep recurrent neural networks
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Hybrid speech recognition with Deep Bidirectional LSTM
- Computer Science2013 IEEE Workshop on Automatic Speech Recognition and Understanding
- 2013
The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.
Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling
- Computer Science2009 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2009
This paper demonstrates that neural-network acoustic models can be trained with sequence classification criteria using exactly the same lattice-based methods that have been developed for Gaussian mixture HMMs, and that using a sequence classification criterion in training leads to considerably better performance.
GMM-Free DNN Training
- Computer Science
- 2014
It is shown that CD trees can be built with DNN alignments which are better matched to the DNN model and its features and that these trees and alignments result in better models than from the GMM alignments and trees.
Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
This work investigates back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, for conversational speech transcription and finds that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness.
Bidirectional recurrent neural networks
- Computer ScienceIEEE Trans. Signal Process.
- 1997
It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.