An analysis of convolutional neural networks for speech recognition

@article{Huang2015AnAO,
  title={An analysis of convolutional neural networks for speech recognition},
  author={Jui Ting Huang and Jinyu Li and Yifan Gong},
  journal={2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2015},
  pages={4989-4993}
}
  • J. Huang, Jinyu Li, Y. Gong
  • Published 19 April 2015
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Despite the fact that several sites have reported the effectiveness of convolutional neural networks (CNNs) on some tasks, there is no deep analysis regarding why CNNs perform well and in which case we should see CNNs' advantage. [] Key Method We then identify four domains we think CNNs can consistently provide advantages over fully-connected deep neural networks (DNNs): channel-mismatched training-test conditions, noise robustness, distant speech recognition, and low-footprint models. For distant speech…

Figures and Tables from this paper

Phone recognition with hierarchical convolutional deep maxout networks
  • L. Tóth
  • Computer Science
    EURASIP J. Audio Speech Music. Process.
  • 2015
TLDR
It is shown that with the hierarchical modelling approach, the CNN can reduce the error rate of the network on an expanded context of input, and it is found that all the proposed modelling improvements give consistently better results for this larger database as well.
Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
TLDR
The results of contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems are presented and the optimal neural network structures and training strategies for the proposed neural network models are explored.
Noise robust speech recognition using recent developments in neural networks for computer vision
TLDR
This paper considers two approaches recently developed for image classification and examines their impacts on noisy speech recognition performance, including the use of a Parametric Rectified Linear Unit (PReLU).
Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Explorer Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition
TLDR
Experimental results on the Aurora2 database show that bottleneck features extracted by CBN outperform some conventional speech features and also robust Features extracted by CNN.
Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network
TLDR
The study reveals that the CNN-based method for identifying tonal speech sentences and adding instrumental knowledge performs better than the existing and conventional approaches.
Multiresolution convolutional neural network for robust speech recognition
TLDR
Recognition accuracy on Aurora 2 database, show that MRCNN with two CNNs and corresponding 1×6 and 1×20 convolution filter sizes outperformsCNNs and other MRCnns setting in extracting robust features.
Speech recognition based on convolutional neural networks
TLDR
Experimental results show that CNNs can efficiently implement isolated word recognition and is an alternative type of neural network that can reduce spectral variation and model spectral correlations which exist in signals.
Increasing the robustness of CNN acoustic models using ARMA spectrogram features and channel dropout
TLDR
This work proposes an improved version of input dropout, which exploits the special structure of the input time-frequency representation, and replaced the standard mel-spectrogram input representation with the autoregressive moving average (ARMA) spectrogram, which was recently shown to outperform the former under mismatched train-test conditions.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
Convolutional Neural Networks for Speech Recognition
TLDR
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Phone recognition with hierarchical convolutional deep maxout networks
  • L. Tóth
  • Computer Science
    EURASIP J. Audio Speech Music. Process.
  • 2015
TLDR
It is shown that with the hierarchical modelling approach, the CNN can reduce the error rate of the network on an expanded context of input, and it is found that all the proposed modelling improvements give consistently better results for this larger database as well.
Convolutional Neural Networks for Distant Speech Recognition
TLDR
This work investigates convolutional neural networks for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM), and proposes a channel-wise convolution with two-way pooling.
Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks.
TLDR
This paper argues that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals, and shows that these representations become increasingly insensitive to small perturbations in the input with increasing network depth.
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced.
Robust CNN-based speech recognition with Gabor filter kernels
TLDR
A neural network architecture called a Gabor Convolutional Neural Network (GCNN) is proposed that incorporates Gabor functions into convolutional filter kernels and performs better than other noise-robust features that have been tried, namely, ETSI-AFE, PNCC, Gabor features without the CNN-based approach, and the best neural network features that don’t incorporateGabor functions.
Convolutional deep maxout networks for phone recognition
TLDR
Phone recognition tests on the TIMIT database show that switching to maxout units from rectifier units decreases the phone error rate for each network configuration studied, and yields relative error rate reductions of between 2% and 6%.
Improving language-universal feature extraction with deep maxout and convolutional neural networks
TLDR
Different strategies to further improve LUFEs are explored, including replacing the standard sigmoid nonlinearity with the recently proposed maxout units and applying the convolutional neural network architecture to obtain more invariant feature space.
On rectified linear units for speech processing
TLDR
This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.
...
...