• Corpus ID: 8079975

Exploring convolutional neural network structures and optimization techniques for speech recognition

  title={Exploring convolutional neural network structures and optimization techniques for speech recognition},
  author={Ossama Abdel-Hamid and Li Deng and Dong Yu},
Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neural networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recognition task. [] Key Method We first investigate several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers.

Figures and Tables from this paper

Phone recognition with hierarchical convolutional deep maxout networks
  • L. Tóth
  • Computer Science
    EURASIP J. Audio Speech Music. Process.
  • 2015
It is shown that with the hierarchical modelling approach, the CNN can reduce the error rate of the network on an expanded context of input, and it is found that all the proposed modelling improvements give consistently better results for this larger database as well.
Convolutional Neural Networks for Speech Recognition
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Convolutional deep maxout networks for phone recognition
Phone recognition tests on the TIMIT database show that switching to maxout units from rectifier units decreases the phone error rate for each network configuration studied, and yields relative error rate reductions of between 2% and 6%.
Exploiting Depth and Highway Connections in Convolutional Recurrent Deep Neural Networks for Speech Recognition
The CLDNN model is extended by introducing a highway connection between LSTM layers, which enables direct information flow from cells of lower layers to cells of upper layers, and is able to better exploit the advantages of a deeper structure.
Performance Evaluation of Deep Convolutional Maxout Neural Network in Speech Recognition
The results obtained from the experiments show that the combined model (CMDNN) improves the performance of ANNs in speech recognition versus the pre-trained fully connected fully connected NNs with sigmoid neurons by about 3%.
A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition
A hybrid architecture of CNN-BLSTM is proposed to appropriately use spatial and temporal properties of the speech signal and to improve the continuous speech recognition task and overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN.
Recurrent convolutional neural network for speech processing
A recently developed deep learning model, recurrent convolutional neural network (RCNN), is proposed to use for speech processing, which inherits some merits of recurrent neural networks (RNN) and convolutionals (CNN) and is competitive with previous methods in terms of accuracy and efficiency.
Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT
This paper proposes to combine CNN, GRU-RNN and DNN in a single deep architecture called Convolutional Gated Recurrent Unit, Deep Neural Network (CGDNN).
Convolutional Neural Networks for Distant Speech Recognition
This work investigates convolutional neural networks for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM), and proposes a channel-wise convolution with two-way pooling.
Spoken Letter Recognition using Deep Convolutional Neural Networks on Sparse and Dissimilar Data
The approach is made, to use transfer learning on DCNNs for spoken letter recognition, although the target data is very dissimilar from the source data, to show the range of application for transfer learning.


Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
The Deep Tensor Neural Network With Applications to Large Vocabulary Speech Recognition
Evaluation on Switchboard tasks indicates that DTNNs can outperform the already high-performing DNNs with 4-5% and 3% relative word error reduction, respectively, using 30-hr and 309-hr training sets.
Speech Recognition Using Long-Span Temporal Patterns in a Deep Network Model
It is shown that word recognition accuracy can be significantly enhanced by arranging DNNs in a hierarchical structure to model long-term energy trajectories and evaluated on the 5000-word Wall Street Journal task.
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
Acoustic Modeling Using Deep Belief Networks
It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.
Making Deep Belief Networks effective for large vocabulary continuous speech recognition
This paper explores the performance of DBNs in a state-of-the-art LVCSR system, showing improvements over Multi-Layer Perceptrons (MLPs) and GMM/HMMs across a variety of features on an English Broadcast News task.
Gradient-based learning applied to document recognition
This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.