• Corpus ID: 785223

Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training

@inproceedings{Miao2013ImprovingLC,
  title={Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training},
  author={Yajie Miao and Florian Metze},
  booktitle={INTERSPEECH},
  year={2013}
}
We investigate two strategies to improve the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) in low-resource speech recognition. Although outperforming the conventional Gaussian mixture model (GMM) HMM on various tasks, CD-DNN-HMM acoustic modeling becomes challenging with limited transcribed speech, e.g., less than 10 hours. To resolve this issue, we firstly exploit dropout which prevents overfitting in DNN finetuning and improves model robustness under data sparseness… 

Figures and Tables from this paper

MAXOUT NETWORKS FOR LOW-RESOURCE SPEECH RECOGNITION
TLDR
This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks, and focuses on the particular advantage of DMNs under low-resource conditions with limited transcribed speech.
Deep maxout networks for low-resource speech recognition
TLDR
This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks, and focuses on the particular advantage of DMNs under low-resource conditions with limited transcribed speech.
Investigation of different acoustic modeling techniques for low resource Indian language data
TLDR
This paper proposes to train DNN containing bottleneck layer in two stages, which shows improved performance when compared to baseline SGMM and DNN models for limited training data.
Improving deep neural networks for LVCSR using dropout and shrinking structure
TLDR
Two new methods to further improve the hybrid DNN/HMMs model are proposed that use dropout as pre-conditioner (DAP) to initialize DNN prior to back-propagation (BP) for better recognition accuracy and employ a shrinking DNN structure with hidden layers decreasing in size from bottom to top for the purpose of reducing model size and expediting computation time.
An empirical study of multilingual and low-resource spoken term detection using deep neural networks
TLDR
The experimental results show that with the help of cross-lingual model transfer, the multilingual spoken term detection (STD) system can be elevated a lot in low-resource settings, and the dropout method is not so effective on cross-centre model transfer task.
Incorporating Context Information into Deep Neural Network Acoustic Models
TLDR
This thesis proposes a framework to build cross-language DNNs via languageuniversal feature extractors (LUFEs), and presents a novel framework to perform feature-space SAT for DNN models, which can be naturally extended to other deep learning models such as CNNs.
Optimizing DNN Adaptation for Recognition of Enhanced Speech
TLDR
This paper proposes to optimize the adaptation of the clean acoustic models towards the enhanced speech by tuning the regularization term based on the degree of enhancement, identifying an optimal setting for improving the speech recognition performance.
Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition
  • Dongpeng Chen, B. Mak
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2015
TLDR
It is demonstrated that the performance of the phone models of a single low-resource language can be improved by training its grapheme models in parallel under the MTL framework, and the proposed MTL methods obtain significant word recognition gains.
Merging of Native and Non-native Speech for Low-resource Accented ASR
This paper presents our recent study on low-resource automatic speech recognition ASR system with accented speech. We propose multi-accent Subspace Gaussian Mixture Models SGMM and accent-specific
Towards speaker adaptive training of deep neural network acoustic models
TLDR
Experiments show that compared with the baseline DNN, the SAT-DNN model brings 7.5% and 6.0% relative improvement when DNN inputs are speaker-independent and speakeradapted features respectively.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 23 REFERENCES
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Exploiting sparseness in deep neural networks for large vocabulary speech recognition
TLDR
The goal of enforcing sparseness as soft regularization and convex constraint optimization problems is formulated, solutions under the stochastic gradient ascent setting are proposed, and novel data structures are proposed to exploit the randomSparseness patterns to reduce model size and computation time.
Improving deep neural networks for LVCSR using rectified linear units and dropout
TLDR
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR
TLDR
This work investigates the use of cross-lingual acoustic data to initialise deep neural network (DNN) acoustic models by means of unsupervised restricted Boltzmann machine (RBM) pre-training and shows that unsuper supervised pretraining is more crucial for the hybrid setups, particularly with limited amounts of transcribed training data.
Extracting deep bottleneck features using stacked auto-encoders
TLDR
It is found that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available.
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
TLDR
This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.
Subspace mixture model for low-resource speech recognition in cross-lingual settings
TLDR
This work investigates an extension to SGMM, referred to as subspace mixture model (SMM), in which subspace parameters on the target language are casted as a linear mixture of the subspaces derived from source languages.
Conversational Speech Transcription Using Context-Dependent Deep Neural Networks
Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network
Regularized subspace Gaussian mixture models for cross-lingual speech recognition
TLDR
It is shown that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language, and regularizing the estimation of the SGMM state vectors by penalizing their ℓ1-norm help to overcome numerical instabilities and lead to lower WER.
Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error
TLDR
This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary speech recognition tasks.
...
1
2
3
...