Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

  title={Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition},
  author={Ossama Abdel-Hamid and Abdel-rahman Mohamed and Hui Jiang and Gerald Penn},
  journal={2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance… 

Figures and Tables from this paper

Convolutional Neural Networks for Speech Recognition
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Exploring convolutional neural network structures and optimization techniques for speech recognition
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
This paper proposes an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections, and argues that CNNs have the capability to model temporal correlations with appropriate context information.
Convolutional Neural Network and Feature Transformation for Distant Speech Recognition
It is argued that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation.
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition
The proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition and are competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
Application of convolutional neural networks to speaker recognition in noisy conditions
This paper applies a convolutional neural network (CNN) trained for automatic speech recognition (ASR) to the task of speaker identification (SID). In the CNN/i-vector front end, the sufficient
Automatic Speech Recognition Using Deep Neural Networks: New Possibilities
This dissertation proposes to use the CNN in a way that applies convolution and pooling operations along frequency to handle frequency variations that commonly happen due to speaker and pronunciation differences in speech signals.
A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition
A hybrid architecture of CNN-BLSTM is proposed to appropriately use spatial and temporal properties of the speech signal and to improve the continuous speech recognition task and overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN.
Convolutional Neural Networks-based continuous speech recognition using raw speech signal
The studies show that the CNN-based approach achieves better performance than the conventional ANN- based approach with as many parameters and that the features learned from raw speech by the CNN -based approach could generalize across different databases.


Making Deep Belief Networks effective for large vocabulary continuous speech recognition
This paper explores the performance of DBNs in a state-of-the-art LVCSR system, showing improvements over Multi-Layer Perceptrons (MLPs) and GMM/HMMs across a variety of features on an English Broadcast News task.
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
This work uses the mean-covariance restricted Boltzmann machine (mcRBM) to learn features of speech data that serve as input into a standard DBN, and achieves a phone error rate superior to all published results on speaker-independent TIMIT to date.
Acoustic Modeling Using Deep Belief Networks
It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
Gradient-based learning applied to document recognition
This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.
GradientBased Learning Applied to Document Recognition
Various methods applied to handwritten character recognition are reviewed and compared and Convolutional Neural Networks, that are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques.
Maximum likelihood linear transformations for HMM-based speech recognition
  • M. Gales
  • Computer Science
    Comput. Speech Lang.
  • 1998
The paper compares the two possible forms of model-based transforms: unconstrained, where any combination of mean and variance transform may be used, and constrained, which requires the variance transform to have the same form as the mean transform.
Speaker-independent phone recognition using hidden Markov models
  • Kai-Fu Lee, H. Hon
  • Computer Science
    IEEE Trans. Acoust. Speech Signal Process.
  • 1989
The authors introduce the co-occurrence smoothing algorithm, which enables accurate recognition even with very limited training data, and can be used as benchmarks to evaluate future systems.
Conversational Speech Transcription Using Context-Dependent Deep Neural Networks
Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network
Learning methods for generic object recognition with invariance to pose and lighting
  • Yann LeCun, F. Huang, L. Bottou
  • Computer Science
    Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004.
  • 2004
A real-time version of the system was implemented that can detect and classify objects in natural scenes at around 10 frames per second and proved impractical, while convolutional nets yielded 16/7% error.