• Corpus ID: 9443248

Semi-supervised maximum mutual information training of deep neural network acoustic models

  title={Semi-supervised maximum mutual information training of deep neural network acoustic models},
  author={Vimal Manohar and Daniel Povey and Sanjeev Khudanpur},
Maximum Mutual Information (MMI) is a popular discriminative criterion that has been used in supervised training of acoustic models for automatic speech recognition. However, standard discriminative training is very sensitive to the accuracy of the transcription and hence its implementation in a semisupervised setting requires extensive filtering of data. We will show that if the supervision transcripts are not known, the natural analogue of MMI is to minimize the conditional entropy of the… 

Tables from this paper

Semi-supervised training for automatic speech recognition
It is shown that maximizing Negative Conditional Entropy (NCE) over lattices from unsupervised data, along with state-level Minimum Bayes Risk (sMBR) on supervised data, in a multi-task architecture gives word error rate (WER) improvements.
Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI
Various extensions to standard LF-MMI training are described to allow the use as supervision of lattices obtained via decoding of unsupervised data and different methods for splitting the lattices and incorporating frame tolerances into the supervision FST are investigated.
Investigation of Semi-Supervised Acoustic Model Training Based on the Committee of Heterogeneous Neural Networks
The investigation in this paper focuses on the different approach that uses additional complementary AMs to form a committee of label creation for untranscribed data, and the case of intentional exclusion of the primary seed-AM from the committee, both of which could enhance the chance to find more informative training samples for the seedAM.
Semi-supervised training strategies for deep neural networks
  • M. Gibson, G. Cook, P. Zhan
  • Computer Science
    2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2017
Experimental results on the internal study dataset provide evidence that in a low-resource scenario the most effective semi-supervised training strategy is ‘naive CE’ (treating manually transcribed and automatically transcribed data identically during the cross entropy phase of training) followed by use of a shared hidden layer technique during sequence training.
Semi-supervised ensemble DNN acoustic model training
This work proposes an effective semi-supervised training of deep neural network (DNN) acoustic models by incorporating the diversity among the ensemble of models and the resultant model improved the performance in the lecture transcription task.
Semi-Supervised Acoustic Model Training by Discriminative Data Selection From Multiple ASR Systems’ Hypotheses
A semisupervised training scheme, which takes the advantage of huge quantities of unlabeled video lecture archive, particularly for the deep neural network (DNN) acoustic model, is investigated and shows higher quality compared with the conventional system combination method (ROVER).
Active and Semi-Supervised Learning in ASR: Benefits on the Acoustic and Language Models
The results indicate that, while SST is crucial at the beginning of the labeling process, its gains degrade rapidly as AL is set in place, and the proposed approach improves the word error rate by about 12.5% relative.
Semi-Supervised Training in Deep Learning Acoustic Model
It is found that DNN, unfolded RNN, and LSTM-RNN are increasingly more sensitive to labeling errors and the importance sampling has similar impact on all three models with 2~3% relative WER reduction comparing to the random sampling.
Lightly Supervised vs. Semi-supervised Training of Acoustic Model on Luxembourgish for Low-resource Automatic Speech Recognition
In this work, we focus on exploiting ‘inexpensive’ data in or-der to to improve the DNN acoustic model for ASR. We explore two strategies: The first one uses untranscribed data from the target domain.
Semi-Supervised DNN Training with Word Selection for ASR
The question of the granularity of confidences (per-sentence, per-word, perframe), the question of how the data should be used (dataselection by masks, or in mini-batch SGD with confidences as weights).


Semi-supervised training of Deep Neural Networks
It is beneficial to reduce the disproportion in amounts of transcribed and untranscribed data by including the transcribed data several times, as well as to do a frame-selection based on per-frame confidences derived from confusion in a lattice.
Lightly supervised and unsupervised acoustic model training
Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data.
Semi-supervised DNN training in meeting recognition
This paper explores semi-supervised training of Deep Neural Networks (DNN) in a meeting recognition task and investigates two options available to reduce that: selecting data with fewer errors, and changing the dependence on noise by reducing label precision.
Unsupervised training and directed manual transcription for LVCSR
Multi-view and multi-objective semi-supervised learning for large vocabulary continuous speech recognition
This paper proposes SSL for LVCSR by using the multiple views learned from different acoustic features and randomized decision trees and develops the multi-objective learning of HMM-based acoustic models by optimizing a hybrid criterion established by the combination of the discriminative mutual information from labeled data and the entropy from unlabeled data.
Sequence-discriminative training of deep neural networks
Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.
Improving deep neural network acoustic models using generalized maxout networks
This paper introduces two new types of generalized maxout units, which they are called p-norm and soft-maxout, and presents a method to control that instability during training when training unbounded-output nonlinearities.
Discriminative training of acoustic models applied to domains with unreliable transcripts [speech recognition applications]
This paper identifies "reliable" regions in the transcript that can be used for training acoustic models in ASR systems and shows that discriminative training gives us word error rate (WER) reductions of 8-15% relative to the baseline.
Investigating Data Selection for Minimum Phone Error Training of Acoustic Models
A novel data selection approach based on the normalized frame-level entropy of Gaussian posterior probabilities obtained from the word lattice of the training utterance was explored and it has the merit of making the training algorithm focus much more on the training statistics of those frame samples that center nearly around the decision boundary for better discrimination.
MMIE training of large vocabulary recognition systems