Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

@inproceedings{Povey2016PurelySN,
  title={Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI},
  author={Daniel Povey and Vijayaditya Peddinti and Daniel Galvez and Pegah Ghahremani and Vimal Manohar and X. Na and Yiming Wang and S. Khudanpur},
  booktitle={INTERSPEECH},
  year={2016}
}
In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. [...] Key Method To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate.Expand
Comparison of Lattice-Free and Lattice-Based Sequence Discriminative Training Criteria for LVCSR
TLDR
A memory efficient implementation of the forward-backward computation that allows us to use uni-gram word-level language models in the denominator calculation while still doing a full summation on GPU and found that silence modeling seriously impacts the performance in the lattice-free case and needs special treatment. Expand
A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models
  • Chao Weng, Dong Yu
  • Computer Science, Mathematics
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
It is demonstrated that, analogous to LF-MMI, a neural network acoustic model can also be trained from scratch using LF-bMMI or LF-sMBR criteria respectively without the need of cross-entropy pre-training. Expand
End-to-end Speech Recognition Using Lattice-free MMI
TLDR
The work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models shows that the approach can achieve comparable results to regular LF-M MI on well-known large vocabulary tasks. Expand
Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI
TLDR
Various extensions to standard LF-MMI training are described to allow the use as supervision of lattices obtained via decoding of unsupervised data and different methods for splitting the lattices and incorporating frame tolerances into the supervision FST are investigated. Expand
Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models
TLDR
This paper performs discriminative adaptation using lattices obtained from a first pass decoding, an approach that can be readily integrated into the lattice-free maximum mutual information (LF-MMI) framework. Expand
Sequence Distillation for Purely Sequence Trained Acoustic Models
TLDR
This paper proposes using the sequence-level temper-atured Kullback-Leibler divergence as a metric for TS training and shows that the frame-level TS training sometimes even degrades the performance of the student model whereas the proposed method consistently improved the accuracy. Expand
Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition
TLDR
A joint word-character A2W model that learns to first spell the word and then recognize it and provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training. Expand
Active Learning for LF-MMI Trained Neural Networks in ASR
TLDR
Experimental results suggested that the AL scheme can benefit much more from the fresh data than the SST in reducing the word error rate (WER), and the AL yields 6∼13% relative WER reduction against the baseline trained on a 4000 hours transcribed dataset. Expand
Discriminative training of RNNLMs with the average word error criterion
TLDR
By fine-tuning the RNNLM on lattices with the average edit distance loss, it is shown that a 1.9% relative improvement in word error rate over a purely generatively trained model is obtained. Expand
Domain adaptation of lattice-free MMI based TDNN models for speech recognition
TLDR
This study generalized the KLD regularized model adaptation to train domain-specific TDNN acoustic models and demonstrated that the proposed domain adapted models can achieve around relative 7–29% word error rate reduction on these tasks, even when the adaptation utterances are only around 1 K. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription
TLDR
This work investigates back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, for conversational speech transcription and finds that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness. Expand
Sequence-discriminative training of deep neural networks
TLDR
Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task. Expand
Learning acoustic frame labeling for speech recognition with recurrent neural networks
  • H. Sak, A. Senior, +4 authors J. Schalkwyk
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
It is shown that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTm RNN model trained with CE using HMM state alignments, and the effect of sequence discriminative training on these models is shown. Expand
Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition
TLDR
Novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition are presented. Expand
Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling
  • Brian Kingsbury
  • Computer Science
  • 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2009
TLDR
This paper demonstrates that neural-network acoustic models can be trained with sequence classification criteria using exactly the same lattice-based methods that have been developed for Gaussian mixture HMMs, and that using a sequence classification criterion in training leads to considerably better performance. Expand
Deep bi-directional recurrent networks over spectral windows
TLDR
This paper applies a windowed (truncated) LSTM to conversational speech transcription, and finds that a limited context is adequate, and that it is not necessaary to scan the entire utterance. Expand
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Expand
Speaker adaptation of neural network acoustic models using i-vectors
TLDR
This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed. Expand
Fast and accurate recurrent neural network acoustic models for speech recognition
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. MoreExpand
Advances in speech transcription at IBM under the DARPA EARS program
TLDR
This paper describes the technical and system building advances made in IBM's speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program and presents results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. Expand
...
1
2
3
...