Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training
@inproceedings{Miao2013ImprovingLC, title={Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training}, author={Yajie Miao and Florian Metze}, booktitle={INTERSPEECH}, year={2013} }
We investigate two strategies to improve the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) in low-resource speech recognition. Although outperforming the conventional Gaussian mixture model (GMM) HMM on various tasks, CD-DNN-HMM acoustic modeling becomes challenging with limited transcribed speech, e.g., less than 10 hours. To resolve this issue, we firstly exploit dropout which prevents overfitting in DNN finetuning and improves model robustness under data sparseness…
66 Citations
MAXOUT NETWORKS FOR LOW-RESOURCE SPEECH RECOGNITION
- Computer Science
- 2013
This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks, and focuses on the particular advantage of DMNs under low-resource conditions with limited transcribed speech.
Deep maxout networks for low-resource speech recognition
- Computer Science2013 IEEE Workshop on Automatic Speech Recognition and Understanding
- 2013
This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks, and focuses on the particular advantage of DMNs under low-resource conditions with limited transcribed speech.
Investigation of different acoustic modeling techniques for low resource Indian language data
- Computer Science2015 Twenty First National Conference on Communications (NCC)
- 2015
This paper proposes to train DNN containing bottleneck layer in two stages, which shows improved performance when compared to baseline SGMM and DNN models for limited training data.
Improving deep neural networks for LVCSR using dropout and shrinking structure
- Computer Science2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014
Two new methods to further improve the hybrid DNN/HMMs model are proposed that use dropout as pre-conditioner (DAP) to initialize DNN prior to back-propagation (BP) for better recognition accuracy and employ a shrinking DNN structure with hidden layers decreasing in size from bottom to top for the purpose of reducing model size and expediting computation time.
An empirical study of multilingual and low-resource spoken term detection using deep neural networks
- Computer ScienceINTERSPEECH
- 2014
The experimental results show that with the help of cross-lingual model transfer, the multilingual spoken term detection (STD) system can be elevated a lot in low-resource settings, and the dropout method is not so effective on cross-centre model transfer task.
Incorporating Context Information into Deep Neural Network Acoustic Models
- Computer Science
- 2015
This thesis proposes a framework to build cross-language DNNs via languageuniversal feature extractors (LUFEs), and presents a novel framework to perform feature-space SAT for DNN models, which can be naturally extended to other deep learning models such as CNNs.
Optimizing DNN Adaptation for Recognition of Enhanced Speech
- Computer ScienceINTERSPEECH
- 2017
This paper proposes to optimize the adaptation of the clean acoustic models towards the enhanced speech by tuning the regularization term based on the degree of enhancement, identifying an optimal setting for improving the speech recognition performance.
Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2015
It is demonstrated that the performance of the phone models of a single low-resource language can be improved by training its grapheme models in parallel under the MTL framework, and the proposed MTL methods obtain significant word recognition gains.
Merging of Native and Non-native Speech for Low-resource Accented ASR
- Computer ScienceSLSP
- 2015
This paper presents our recent study on low-resource automatic speech recognition ASR system with accented speech. We propose multi-accent Subspace Gaussian Mixture Models SGMM and accent-specific…
Towards speaker adaptive training of deep neural network acoustic models
- Computer ScienceINTERSPEECH
- 2014
Experiments show that compared with the baseline DNN, the SAT-DNN model brings 7.5% and 6.0% relative improvement when DNN inputs are speaker-independent and speakeradapted features respectively.
References
SHOWING 1-10 OF 23 REFERENCES
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2012
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Exploiting sparseness in deep neural networks for large vocabulary speech recognition
- Computer Science2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2012
The goal of enforcing sparseness as soft regularization and convex constraint optimization problems is formulated, solutions under the stochastic gradient ascent setting are proposed, and novel data structures are proposed to exploit the randomSparseness patterns to reduce model size and computation time.
Improving deep neural networks for LVCSR using rectified linear units and dropout
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR
- Computer Science2012 IEEE Spoken Language Technology Workshop (SLT)
- 2012
This work investigates the use of cross-lingual acoustic data to initialise deep neural network (DNN) acoustic models by means of unsupervised restricted Boltzmann machine (RBM) pre-training and shows that unsuper supervised pretraining is more crucial for the hybrid setups, particularly with limited amounts of transcribed training data.
Extracting deep bottleneck features using stacked auto-encoders
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
It is found that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available.
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
- Computer Science2011 IEEE Workshop on Automatic Speech Recognition & Understanding
- 2011
This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.
Subspace mixture model for low-resource speech recognition in cross-lingual settings
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
This work investigates an extension to SGMM, referred to as subspace mixture model (SMM), in which subspace parameters on the target language are casted as a linear mixture of the subspaces derived from source languages.
Conversational Speech Transcription Using Context-Dependent Deep Neural Networks
- Computer ScienceICML
- 2012
Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network…
Regularized subspace Gaussian mixture models for cross-lingual speech recognition
- Computer Science2011 IEEE Workshop on Automatic Speech Recognition & Understanding
- 2011
It is shown that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language, and regularizing the estimation of the SGMM state vectors by penalizing their ℓ1-norm help to overcome numerical instabilities and lead to lower WER.
Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2007
This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary speech recognition tasks.