• Corpus ID: 11298362

Fast and accurate recurrent neural network acoustic models for speech recognition

@inproceedings{Sak2015FastAA,
  title={Fast and accurate recurrent neural network acoustic models for speech recognition},
  author={Hasim Sak and Andrew W. Senior and Kanishka Rao and Françoise Beaufays},
  booktitle={INTERSPEECH},
  year={2015}
}
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. [] Key Method We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.

Figures and Tables from this paper

Deep LSTM for Large Vocabulary Continuous Speech Recognition
TLDR
This work introduces a training framework with layer-wise training and exponential moving average methods for deeper LSTM models, and introduces the novel transfer learning strategy with segmental Minimum Bayes-Risk, which makes it possible that training with only a small part of dataset could outperform full dataset training from the beginning.
Dynamic Frame Skipping for Fast Speech Recognition in Recurrent Neural Network Based Acoustic Models
TLDR
A novel recurrent neural network architecture called Skip-RNN is proposed, which dynamically skips speech frames that are less important, which can accelerate acoustic model computation by up to 2.4 times without any noticeable degradation in transcription accuracy.
Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model
TLDR
A novel frame retaining method which is applied in decoding and could reduces the time consumption of both training and decoding of long short-term memory recurrent neural networks.
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
TLDR
It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode.
Lower Frame Rate Neural Network Acoustic Models
TLDR
On a large vocabulary Voice Search task, it is shown that with conventional models, one can slow the frame rate to 40ms while improving WER by 3% relative over a CTC-based model, thus improving overall system speed.
The microsoft 2016 conversational speech recognition system
  • W. Xiong, J. Droppo, G. Zweig
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
Microsoft's conversational speech recognition system is described, in which recent developments in neural-network-based acoustic and language modeling are combined to advance the state of the art on the Switchboard recognition task.
Long short-term memory recurrent neural network architectures for Urdu acoustic modeling
TLDR
LSTM architectures were compared with gated recurrent unit (GRU) based architectures and it was found that LSTM has an advantage over GRU.
Simplifying long short-term memory acoustic models for fast training and decoding
TLDR
To accelerate decoding of LSTMs, it is proposed to apply frame skipping during training, and frame skipping and posterior copying (FSPC) during decoding to resolve two challenges faced by LSTM models: high model complexity and poor decoding efficiency.
On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition
TLDR
This paper presents a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy, and demonstrates that using multiple recurrent layers in the encoder can reduce the word error rate.
On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition
TLDR
This work presents a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices and finds that the proposed technique allows us to reduce the size of the authors' Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Learning acoustic frame labeling for speech recognition with recurrent neural networks
  • H. Sak, A. Senior, J. Schalkwyk
  • Computer Science, Physics
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
It is shown that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTm RNN model trained with CE using HMM state alignments, and the effect of sequence discriminative training on these models is shown.
Context dependent phone models for LSTM RNN acoustic modelling
  • A. Senior, H. Sak, I. Shafran
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
This work argues that using multi-state HMMs with LSTM RNN acoustic models is an unnecessary vestige of GMM-HMM and DNN- HMM modelling and shows that minimum-duration modelling can lead to improved results.
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
TLDR
The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance.
Sequence discriminative distributed training of long short-term memory recurrent neural networks
TLDR
This paper compares two sequence discriminative criteria – maximum mutual information and state-level minimum Bayes risk, and investigates a number of variations of the basic training strategy to better understand issues raised by both the sequential model, and the objective function.
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Hybrid speech recognition with Deep Bidirectional LSTM
TLDR
The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.
Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling
  • Brian Kingsbury
  • Computer Science
    2009 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2009
TLDR
This paper demonstrates that neural-network acoustic models can be trained with sequence classification criteria using exactly the same lattice-based methods that have been developed for Gaussian mixture HMMs, and that using a sequence classification criterion in training leads to considerably better performance.
GMM-Free DNN Training
TLDR
It is shown that CD trees can be built with DNN alignments which are better matched to the DNN model and its features and that these trees and alignments result in better models than from the GMM alignments and trees.
Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription
TLDR
This work investigates back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, for conversational speech transcription and finds that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness.
Bidirectional recurrent neural networks
TLDR
It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
...
1
2
3
...