Convolutional Neural Networks for Speech Recognition

  title={Convolutional Neural Networks for Speech Recognition},
  author={Ossama Abdel-Hamid and Abdel-rahman Mohamed and Hui Jiang and Li Deng and Gerald Penn and Dong Yu},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
Recently, the hybrid deep neural network (DNN)-hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. [] Key Method We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features.

Figures and Tables from this paper

Convolutional Neural Network and Feature Transformation for Distant Speech Recognition

It is argued that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation.

Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition

The results of contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems are presented and the optimal neural network structures and training strategies for the proposed neural network models are explored.

Performance Evaluation of Deep Convolutional Maxout Neural Network in Speech Recognition

The results obtained from the experiments show that the combined model (CMDNN) improves the performance of ANNs in speech recognition versus the pre-trained fully connected fully connected NNs with sigmoid neurons by about 3%.

Automatic Speech Recognition Using Deep Neural Networks: New Possibilities

This dissertation proposes to use the CNN in a way that applies convolution and pooling operations along frequency to handle frequency variations that commonly happen due to speaker and pronunciation differences in speech signals.

Noise robust speech recognition using recent developments in neural networks for computer vision

This paper considers two approaches recently developed for image classification and examines their impacts on noisy speech recognition performance, including the use of a Parametric Rectified Linear Unit (PReLU).

Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition

The proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition and are competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model.

An analysis of convolutional neural networks for speech recognition

  • J. HuangJinyu LiY. Gong
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices.

Deep Residual Networks with Auditory Inspired Features for Robust Speech Recognition

A Deep Residual Network architecture is proposed, allowing ResNets to be used in speech recognition tasks where the network input is small in comparison with the image dimensions for which they were initially designed, and a modification of the well-known Power Normalized Cepstral Coefficients as input to the ResNet is introduced with the aim of creating a noise invariant representation of the acoustic space.

Recurrent convolutional neural network for speech processing

A recently developed deep learning model, recurrent convolutional neural network (RCNN), is proposed to use for speech processing, which inherits some merits of recurrent neural networks (RNN) and convolutionals (CNN) and is competitive with previous methods in terms of accuracy and efficiency.



Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.

Exploring convolutional neural network structures and optimization techniques for speech recognition

This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.

Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks.

This paper argues that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals, and shows that these representations become increasingly insensitive to small perturbations in the input with increasing network depth.

Deep convolutional neural networks for LVCSR

This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.

Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM

This paper presents the strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework, and shows that DNNs provide the flexibility of using arbitrary features.

A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion

We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition

This paper has proposed two novel incoherent training methods to explicitly de-correlate BN features in learning of DNN and consistently surpassed the state-of-the-art DNN/HMMs in all evaluated tasks.

Deep Belief Networks using discriminative features for phone recognition

Deep Belief Networks work even better when their inputs are speaker adaptive, discriminative features, and on the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model.

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

This paper investigates DNN for several large vocabulary speech recognition tasks and proposes a few ideas to reconfigure the DNN input features, such as using logarithm spectrum features or VTLN normalized features in DNN.