Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

@article{Dahl2012ContextDependentPD,
  title={Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition},
  author={George E. Dahl and Dong Yu and Li Deng and Alex Acero},
  journal={IEEE Transactions on Audio, Speech, and Language Processing},
  year={2012},
  volume={20},
  pages={30-42}
}
We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks… 

Figures and Tables from this paper

Pipelined Back-Propagation for Context-Dependent Deep Neural Networks
TLDR
It is shown that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server.
Standalone training of context-dependent deep neural network acoustic models
  • Chao Zhang, P. Woodland
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
This paper introduces a method for training state-of-the-art CD-DNN-HMMs without relying on such a pre-existing system, and achieves this in two steps: build a context-independent (CI) DNN iteratively with word transcriptions, and cluster the equivalent output distributions of the untied CD-HMM states using the decision tree based state tying approach.
Context-dependent deep neural networks for commercial Mandarin speech recognition applications
TLDR
It is demonstrated that CD-DNN-HMMs can get relative 26% word error reduction and relative 16% sentence error reduction in Baidu's short message (SMS) voice input and voice search applications, respectively, compared with state-of-the-art CD-GMM-HMM trained using fMPE.
Pipelined BackPropagation for Context-Dependent Deep Neural Networks
TLDR
It is shown that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server.
Deep neural network acoustic modeling for native and non-native Mandarin speech recognition
  • Xin Chen, Jian Cheng
  • Computer Science
    The 9th International Symposium on Chinese Spoken Language Processing
  • 2012
TLDR
This paper applied CD-DNN-HMM acoustic models to an automatic Spoken Chinese Test and improved the speech recognition performance significantly and investigated accent adaptation on a Linear Input Network (LIN) - ReLU DNN network structure to enhance the non-native speech recognition accuracy.
FACTORIZED DEEP NEURAL NETWORKS FOR ADAPTIVE SPEECH RECOGNITION
TLDR
Two types of factorized adaptive DNNs are proposed and described, improving the earlier versions of CD-DNN-HMMs and providing new ways of modeling speaker and environment factors and insight onto how environment invariant DNN models may be constructed and subsequently trained.
KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition
TLDR
Experiments demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.
Hybrid context dependent CD-DNN-HMM Keyword Spotting (KWS) in speech conversations
  • V. Tyagi
  • Computer Science
    2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2016
TLDR
A filler based Hybrid DNN-HMM Keyword Spotting KWS system which to the authors' knowledge is the first KWS architecture using context dependent DNN and HMM provides an attractive alternate solution for near real-time KWS applications with high detection accuracy and low FA.
Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition
TLDR
This paper reports results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously, and outperforms the best Gaussian Mixture Model Hidden Markov Model baseline.
Decision tree based state tying for speech recognition using DNN derived embeddings
  • Xiangang Li, Xihong Wu
  • Computer Science
    The 9th International Symposium on Chinese Spoken Language Processing
  • 2014
TLDR
A novel decision tree based state tying procedure is proposed, in which, the state embeddings derived from DNN are used and clustered to minimize the sum-of-squared error and the results demonstrate that the proposed DNNbased state tying approach yielded comparable performance to the GMM based one.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 94 REFERENCES
Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition
TLDR
It is shown that pre-training can initialize weights to a point in the space where fine-tuning can be effective and thus is crucial in training deep structured models and in the recognition performance of a CD-DBN-HMM based large-vocabulary speech recognizer.
Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system
TLDR
A new training procedure that "smooths" networks with different degrees of context dependence is proposed to obtain a robust estimate of the context-dependent probabilities of the HMM/MLP speaker-independent continuous speech recognition system.
Large vocabulary continuous speech recognition with context-dependent DBN-HMMS
TLDR
This work proposes a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task.
CDNN: a context dependent neural network for continuous speech recognition
TLDR
It is shown how, without any simplifying assumptions, one can estimate likelihoods for context-dependent phonetic models with nets that are not substantially larger than context-independent MLPs.
Investigation of full-sequence training of deep belief networks for speech recognition
TLDR
It is shown that the DBNs learned using the sequence-based training criterion outperform those with frame-based criterion using both threelayer and six-layer models, but the optimization procedure for the deeper DBN is more difficult for the former criterion.
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
TLDR
This work uses the mean-covariance restricted Boltzmann machine (mcRBM) to learn features of speech data that serve as input into a standard DBN, and achieves a phone error rate superior to all published results on speaker-independent TIMIT to date.
A segmental CRF approach to large vocabulary continuous speech recognition
  • G. Zweig, P. Nguyen
  • Computer Science
    2009 IEEE Workshop on Automatic Speech Recognition & Understanding
  • 2009
TLDR
A segmental conditional random field framework for large vocabulary continuous speech recognition that allows for the joint or separate discriminative training of the acoustic and language models.
Speech Recognition Using Augmented Conditional Random Fields
TLDR
A new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed, which addresses some limitations of HMMs while maintaining many of the aspects which have made them successful.
Deep Belief Networks for phone recognition
TLDR
Deep Belief Networks (DBNs) have recently proved to be very effective in a variety of machine learning problems and this paper applies DBNs to acous ti modeling.
Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error
TLDR
This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary speech recognition tasks.
...
1
2
3
4
5
...