Advances in Very Deep Convolutional Neural Networks for LVCSR

@inproceedings{Sercu2016AdvancesIV,
  title={Advances in Very Deep Convolutional Neural Networks for LVCSR},
  author={Tom Sercu and Vaibhava Goel},
  booktitle={INTERSPEECH},
  year={2016}
}
Very deep CNNs with small 3x3 kernels have recently been shown to achieve very strong performance as acoustic models in hybrid NN-HMM speech recognition systems. In this paper we investigate how to efficiently scale these models to larger datasets. Specifically, we address the design choice of pooling and padding along the time dimension which renders convolutional evaluation of sequences highly inefficient. We propose a new CNN design without timepadding and without timepooling, which is… 

Figures and Tables from this paper

A Comprehensive Study of Residual CNNS for Acoustic Modeling in ASR
TLDR
The speed of CNN in training and decoding is significantly increased, approaching the speed of the offline LSTM, and the overfitting problem in training is faced and solved with data augmentation, namely SpecAugment.
Explorer Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition
TLDR
This paper introduces an asymmetric dilated convolution that allows for efficient and elegant implementation of pooling in time and shows results using time-dilated convolutions in a very deep VGG-style CNN with batch normalization on the Hub5 Switchboard-2000 benchmark task.
Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Residual Convolutional CTC Networks for Automatic Speech Recognition
TLDR
Experimental results show that the proposed single system RCNN-CTC can achieve the lowest word error rate (WER) on WSJ and Tencent Chat data sets, compared to several widely used neural network systems in ASR.
Ensembles of Multi-Scale VGG Acoustic Models
TLDR
The results show that it is possible to bundle the individual recognition strengths of the VGGs in a much simpler CNN architecture that yields equal performance with the best late combination, and a large multi-scale system by means of system combination.
Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition
TLDR
The outstanding performance of the novel deep recurrent convolutional neural network applied with deep residual learning indicates that it can be potentially adopted in other sequential problems.
Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation
TLDR
It is found that deep CNN embeddings outperform DNNembeddings for acoustic model adaptation and auxiliary features based on deep CNN embedded features result in similar word error rates to i-vectors.
Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling
TLDR
Using BMUF as the top choice to train acoustic models since it is most stable, scales well with number of GPUs, can achieve reproducible results, and in many cases even outperforms single-GPU SGD can be used as a substitute in some cases.
Very deep convolutional networks for end-to-end speech recognition
TLDR
This work successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models, and applies network-in-network principles, batch normalization, residual connections and convolutionAL LSTMs to build very deep recurrent and Convolutional structures.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Very deep multilingual convolutional neural networks for LVCSR
TLDR
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
Deep Convolutional Neural Networks for Large-scale Speech Tasks
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
...
1
2
3
4
...