Improvements to Deep Convolutional Neural Networks for LVCSR

@article{Sainath2013ImprovementsTD,
  title={Improvements to Deep Convolutional Neural Networks for LVCSR},
  author={Tara N. Sainath and Brian Kingsbury and Abdel-rahman Mohamed and George E. Dahl and George Saon and Hagen Soltau and Tom{\'a}s Beran and Aleksandr Y. Aravkin and Bhuvana Ramabhadran},
  journal={2013 IEEE Workshop on Automatic Speech Recognition and Understanding},
  year={2013},
  pages={315-320}
}
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full… 

Figures and Tables from this paper

Phone recognition with hierarchical convolutional deep maxout networks
  • L. Tóth
  • Computer Science
    EURASIP J. Audio Speech Music. Process.
  • 2015
TLDR
It is shown that with the hierarchical modelling approach, the CNN can reduce the error rate of the network on an expanded context of input, and it is found that all the proposed modelling improvements give consistently better results for this larger database as well.
Explorer Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Improvements to speaker adaptive training of deep neural networks
TLDR
Different methods to further improve and extend SAT-DNN to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction are presented.
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
TLDR
This paper takes advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture, and finds that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.
Improving language-universal feature extraction with deep maxout and convolutional neural networks
TLDR
Different strategies to further improve LUFEs are explored, including replacing the standard sigmoid nonlinearity with the recently proposed maxout units and applying the convolutional neural network architecture to obtain more invariant feature space.
Incorporating Context Information into Deep Neural Network Acoustic Models
TLDR
This thesis proposes a framework to build cross-language DNNs via languageuniversal feature extractors (LUFEs), and presents a novel framework to perform feature-space SAT for DNN models, which can be naturally extended to other deep learning models such as CNNs.
Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
TLDR
This paper proposes an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections, and argues that CNNs have the capability to model temporal correlations with appropriate context information.
An analysis of convolutional neural networks for speech recognition
  • J. Huang, Jinyu Li, Y. Gong
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices.
Convolutional neural networks for acoustic modeling of raw time signal in LVCSR
TLDR
It is shown that the performance gap between DNNs trained on spliced hand-crafted features and DNN's trained on raw time signal can be strongly reduced by introducing 1D-convolutional layers.
...
...

References

SHOWING 1-10 OF 25 REFERENCES
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
Improving deep neural networks for LVCSR using rectified linear units and dropout
TLDR
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
Convolutional neural networks applied to house numbers digit classification
TLDR
This work augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establishes a new state-of-the-art of 95.10% accuracy on the SVHN dataset (48% error improvement).
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Deep Neural Networks for Acoustic Modeling in Speech Recognition
TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly
Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization
TLDR
A distributed neural network training algorithm, based on Hessianfree optimization, that scales to deep networks and large data sets and yields relative reductions in word error rate of 7–13% over cross-entropy training with stochastic gradient descent on two larger tasks: Switchboard and DARPA RATS noisy Levantine Arabic.
...
...