Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

  title={Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI},
  author={Daniel Povey and Vijayaditya Peddinti and Daniel Galvez and Pegah Ghahremani and Vimal Manohar and Xingyu Na and Yiming Wang and Sanjeev Khudanpur},
In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. [] Key Method To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate.

Tables from this paper

Comparison of Lattice-Free and Lattice-Based Sequence Discriminative Training Criteria for LVCSR

A memory efficient implementation of the forward-backward computation that allows us to use uni-gram word-level language models in the denominator calculation while still doing a full summation on GPU and found that silence modeling seriously impacts the performance in the lattice-free case and needs special treatment.

End-to-end Speech Recognition Using Lattice-free MMI

The work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models shows that this approach can achieve comparable results to regular LF-M MI on well-known large vocabulary tasks.

A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models

  • Chao WengDong Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that, analogous to LF-MMI, a neural network acoustic model can also be trained from scratch using LF-bMMI or LF-sMBR criteria respectively without the need of cross-entropy pre-training.

Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI

Various extensions to standard LF-MMI training are described to allow the use as supervision of lattices obtained via decoding of unsupervised data and different methods for splitting the lattices and incorporating frame tolerances into the supervision FST are investigated.

Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

This paper performs discriminative adaptation using lattices obtained from a first pass decoding, an approach that can be readily integrated into the lattice-free maximum mutual information (LF-MMI) framework.

Sequence Distillation for Purely Sequence Trained Acoustic Models

This paper proposes using the sequence-level temper-atured Kullback-Leibler divergence as a metric for TS training and shows that the frame-level TS training sometimes even degrades the performance of the student model whereas the proposed method consistently improved the accuracy.

Active Learning for LF-MMI Trained Neural Networks in ASR

Experimental results suggested that the AL scheme can benefit much more from the fresh data than the SST in reducing the word error rate (WER), and the AL yields 6 ∼ 13% relative WER reduction against the baseline trained on a 4000 hours transcribed dataset.

Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition

A joint word-character A2W model that learns to first spell the word and then recognize it and provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.

Discriminative training of RNNLMs with the average word error criterion

By fine-tuning the RNNLM on lattices with the average edit distance loss, it is shown that a 1.9% relative improvement in word error rate over a purely generatively trained model is obtained.

Domain adaptation of lattice-free MMI based TDNN models for speech recognition

This study generalized the KLD regularized model adaptation to train domain-specific TDNN acoustic models and demonstrated that the proposed domain adapted models can achieve around relative 7–29% word error rate reduction on these tasks, even when the adaptation utterances are only around 1 K.



Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription

This work investigates back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, for conversational speech transcription and finds that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness.

Sequence-discriminative training of deep neural networks

Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.

Learning acoustic frame labeling for speech recognition with recurrent neural networks

  • H. SakA. Senior J. Schalkwyk
  • Computer Science, Physics
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
It is shown that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTm RNN model trained with CE using HMM state alignments, and the effect of sequence discriminative training on these models is shown.

Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition are presented.

Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling

  • Brian Kingsbury
  • Computer Science
    2009 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2009
This paper demonstrates that neural-network acoustic models can be trained with sequence classification criteria using exactly the same lattice-based methods that have been developed for Gaussian mixture HMMs, and that using a sequence classification criterion in training leads to considerably better performance.

Deep bi-directional recurrent networks over spectral windows

This paper applies a windowed (truncated) LSTM to conversational speech transcription, and finds that a limited context is adequate, and that it is not necessaary to scan the entire utterance.

Deep Speech: Scaling up end-to-end speech recognition

Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.

Speaker adaptation of neural network acoustic models using i-vectors

This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.

Fast and accurate recurrent neural network acoustic models for speech recognition

We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More

Advances in speech transcription at IBM under the DARPA EARS program

This paper describes the technical and system building advances made in IBM's speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program and presents results on English conversational telephony test data from the 2003 and 2004 NIST evaluations.