Improvements to Deep Convolutional Neural Networks for LVCSR

@article{Sainath2013ImprovementsTD,
  title={Improvements to Deep Convolutional Neural Networks for LVCSR},
  author={Tara N. Sainath and Brian Kingsbury and Abdel-rahman Mohamed and George E. Dahl and George Saon and Hagen Soltau and Tom{\'a}s Beran and Aleksandr Y. Aravkin and Bhuvana Ramabhadran},
  journal={2013 IEEE Workshop on Automatic Speech Recognition and Understanding},
  year={2013},
  pages={315-320}
}
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full… 

Figures and Tables from this paper

A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition
TLDR
A hybrid architecture of CNN-BLSTM is proposed to appropriately use spatial and temporal properties of the speech signal and to improve the continuous speech recognition task and overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN.
Phone recognition with hierarchical convolutional deep maxout networks
  • L. Tóth
  • Computer Science
    EURASIP J. Audio Speech Music. Process.
  • 2015
TLDR
It is shown that with the hierarchical modelling approach, the CNN can reduce the error rate of the network on an expanded context of input, and it is found that all the proposed modelling improvements give consistently better results for this larger database as well.
Explorer Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Improvements to speaker adaptive training of deep neural networks
TLDR
Different methods to further improve and extend SAT-DNN to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction are presented.
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
TLDR
This paper takes advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture, and finds that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.
Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Improving language-universal feature extraction with deep maxout and convolutional neural networks
TLDR
Different strategies to further improve LUFEs are explored, including replacing the standard sigmoid nonlinearity with the recently proposed maxout units and applying the convolutional neural network architecture to obtain more invariant feature space.
Incorporating Context Information into Deep Neural Network Acoustic Models
TLDR
This thesis proposes a framework to build cross-language DNNs via languageuniversal feature extractors (LUFEs), and presents a novel framework to perform feature-space SAT for DNN models, which can be naturally extended to other deep learning models such as CNNs.
Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
TLDR
This paper proposes an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections, and argues that CNNs have the capability to model temporal correlations with appropriate context information.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
Improving deep neural networks for LVCSR using rectified linear units and dropout
TLDR
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
Auto-encoder bottleneck features using deep belief networks
TLDR
The experiments indicate that with the AE-BN architecture, pre-trained and deeper NNs produce better AE-NP features, and system combination with the GMM/HMM baseline andAE-BN systems provides an additional 0.5% absolute improvement on a larger Broadcast News task.
Convolutional neural networks applied to house numbers digit classification
TLDR
This work augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establishes a new state-of-the-art of 95.10% accuracy on the SVHN dataset (48% error improvement).
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Deep Neural Networks for Acoustic Modeling in Speech Recognition
TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly
...
1
2
3
...