Semi-Supervised Training in Deep Learning Acoustic Model

Abstract

We studied the semi-supervised training in a fully connected deep neural network (DNN), unfolded recurrent neural network (RNN), and long short-term memory recurrent neural network (LSTM-RNN) with respect to the transcription quality, the importance data sampling, and the training data amount. We found that DNN, unfolded RNN, and LSTM-RNN are increasingly more sensitive to labeling errors. For example, with the simulated erroneous training transcription at 5%, 10%, or 15% word error rate (WER) level, the semi-supervised DNN yields 2.37%, 4.84%, or 7.46% relative WER increase against the baseline model trained with the perfect transcription; in comparison, the corresponding WER increase is 2.53%, 4.89%, or 8.85% in an unfolded RNN and 4.47%, 9.38%, or 14.01% in an LSTMRNN. We further found that the importance sampling has similar impact on all three models with 2∼3% relative WER reduction comparing to the random sampling. Lastly, we compared the modeling capability with increased training data. Experimental results suggested that LSTM-RNN can benefit more from enlarged training data comparing to unfolded RNN and DNN. We trained a semi-supervised LSTM-RNN using 2600 hr transcribed and 10100 hr untranscribed data on a mobile speech task. The semi-supervised LSTM-RNN yields 6.56% relative WER reduction against the supervised baseline.

DOI: 10.21437/Interspeech.2016-1596

4 Figures and Tables

Cite this paper

@inproceedings{Huang2016SemiSupervisedTI, title={Semi-Supervised Training in Deep Learning Acoustic Model}, author={Yan Huang and Yongqiang Wang and Yifan Gong}, booktitle={INTERSPEECH}, year={2016} }