Acoustic Modeling Using Deep Belief Networks

@article{Mohamed2012AcousticMU,
  title={Acoustic Modeling Using Deep Belief Networks},
  author={Abdel-rahman Mohamed and George E. Dahl and Geoffrey E. Hinton},
  journal={IEEE Transactions on Audio, Speech, and Language Processing},
  year={2012},
  volume={20},
  pages={14-22}
}
Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any… 

Figures and Tables from this paper

Deep Neural Networks for Acoustic Modeling in Speech Recognition

TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.

Deep Belief Networks using discriminative features for phone recognition

TLDR
Deep Belief Networks work even better when their inputs are speaker adaptive, discriminative features, and on the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model.

A critical examination of deep learningapproaches to automated speech recognition

TLDR
The aim is to study different levels of representation for speech acoustic features that are produced by the hidden layers of DBNs and estimate phoneme recognition error and use classification accuracy evaluated with Support Vector Machines as a measure of separability between the DBN representations of 61 phoneme classes.

Understanding how Deep Belief Networks perform acoustic modelling

TLDR
This paper illustrates how each of these three aspects contributes to the DBN's good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the Dbns that preserves the similarity structure of the feature vector at multiple scales.

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

Implementation of DNN-HMM Acoustic Models for Phoneme Recognition

TLDR
This thesis aims to empirically confirm the capability of DNNs to outperform GMMs in acoustic modeling, and provides a systematic procedure to implement DNN-HMM acoustic models for phoneme recognition, including the implementation of a GMM-H MM baseline system.

Speech recognition features based on deep latent Gaussian models

This paper constructs speech features based on a generative model using a deep latent Gaussian model (DLGM), which is trained using stochastic gradient variational Bayes (SGVB) algorithm and performs

Acoustic Modeling Based on Deep Conditional Random Fields

TLDR
A phone recognition task based on DCRFs, formulated using the maximum entropy (MaxEnt) principle, and preliminary results on the TIMIT task show that DC RFs can lead to good results.

Restructuring output layers of deep neural networks using minimum risk parameter clustering

TLDR
This paper attempts to optimize a topology of hidden Markov models (HMMs) for automatic speech recognition by introducing discriminative optimization with discrete constraints on the parameters, which force the parameters to be tied with the parameters of the other states.

Modular combination of deep neural networks for acoustic modeling

TLDR
It is shown that bottleneck features improve the recognition performance of DBN/HMM hybrids, and that the modular combination enables the acoustic model to benefit from a larger temporal context.
...

References

SHOWING 1-10 OF 47 REFERENCES

Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

TLDR
This work uses the mean-covariance restricted Boltzmann machine (mcRBM) to learn features of speech data that serve as input into a standard DBN, and achieves a phone error rate superior to all published results on speaker-independent TIMIT to date.

Investigation of full-sequence training of deep belief networks for speech recognition

TLDR
It is shown that the DBNs learned using the sequence-based training criterion outperform those with frame-based criterion using both threelayer and six-layer models, but the optimization procedure for the deeper DBN is more difficult for the former criterion.

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

Speech Recognition Using Augmented Conditional Random Fields

TLDR
A new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed, which addresses some limitations of HMMs while maintaining many of the aspects which have made them successful.

Speaker-independent phone recognition using hidden Markov models

  • Kai-Fu LeeH. Hon
  • Computer Science
    IEEE Trans. Acoust. Speech Signal Process.
  • 1989
TLDR
The authors introduce the co-occurrence smoothing algorithm, which enables accurate recognition even with very limited training data, and can be used as benchmarks to evaluate future systems.

Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition

  • Fei ShaL. Saul
  • Computer Science
    2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
  • 2006
TLDR
A framework for large margin classification by Gaussian mixture models (GMMs), which have many parallels to support vector machines (SVMs) but use ellipsoids to model classes instead of half-spaces is developed.

A Fast Learning Algorithm for Deep Belief Nets

TLDR
A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.

Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition

  • L. DengDong Yu
  • Physics, Computer Science
    2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
  • 2007
TLDR
The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint Static/delta-cepstra HTM, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model.

The acoustic-modeling problem in automatic speech recognition

TLDR
This thesis is primarily concerned with the use of hidden Markov models to model sequences of feature vectors which lie in a continuous space such as R sub N and explores the trade-off between packing a lot of information into such sequences and being able to model them accurately.

Structured speech modeling

TLDR
This paper shows how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects and demonstrates superior recognizer performance over a modern hidden Markov model-based system.