Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention

@inproceedings{Yu2016DeepCN,
  title={Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention},
  author={Dong Yu and Wayne Xiong and Jasha Droppo and Andreas Stolcke and Guoli Ye and Jinyu Li and Geoffrey Zweig},
  booktitle={INTERSPEECH},
  year={2016}
}
In this paper, we propose a deep convolutional neural network (CNN) with layer-wise context expansion and location-based attention, for large vocabulary speech recognition. [] Key Method For this reason, contrary to other CNNs, no pooling operation is used in our model. Experiments on the 309hr Switchboard task and the 375hr short message dictation task indicates that our model outperforms both the DNN and LSTM significantly.

Figures and Tables from this paper

A Comprehensive Study of Residual CNNS for Acoustic Modeling in ASR
TLDR
The speed of CNN in training and decoding is significantly increased, approaching the speed of the offline LSTM, and the overfitting problem in training is faced and solved with data augmentation, namely SpecAugment.
Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation
TLDR
It is found that deep CNN embeddings outperform DNNembeddings for acoustic model adaptation and auxiliary features based on deep CNN embedded features result in similar word error rates to i-vectors.
Densely Connected Convolutional Networks for Speech Recognition
TLDR
Experimental results show that DenseNet can be used for AM significantly outperforming other neural-based models such as DNNs, CNNs, VGGs.
Recent progresses in deep learning based acoustic models
  • Dong Yu, Jinyu Li
  • Computer Science
    IEEE/CAA Journal of Automatica Sinica
  • 2017
TLDR
This paper describes models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system, the connectionist temporal classification U+0028 CTC U-0029 criterion, and the attention-based sequenceto-sequence translation model.
Recent advances in convolutional neural networks
Environment-Dependent Attention-Driven Recurrent Convolutional Neural Network for Robust Speech Enhancement
TLDR
The proposed end-to-end environment-dependent attention-driven approach integrates an attention mechanism into bidirectional long short-term memory to acquire the weighted dynamic context between consecutive frames and outperformed existing methods on REVERB challenge.
Explorer Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Residual Convolutional CTC Networks for Automatic Speech Recognition
TLDR
Experimental results show that the proposed single system RCNN-CTC can achieve the lowest word error rate (WER) on WSJ and Tencent Chat data sets, compared to several widely used neural network systems in ASR.
Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks
TLDR
A novel model architecture that incorporates the attention mechanism and gated convolutional network (GCN) into the previously developed permutation invariant training based multi-talker speech recognition system (PIT-ASR) is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Deep convolutional neural networks for acoustic modeling in low resource languages
  • William Chan, I. Lane
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
A detailed empirical study of CNNs under the low resource condition is performed, and finds a two dimensional convolutional structure performs the best, and emphasizes the importance to consider time and spectrum in modelling acoustic patterns.
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
TLDR
This paper takes advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture, and finds that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.
Exploring convolutional neural network structures and optimization techniques for speech recognition
TLDR
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Very deep multilingual convolutional neural networks for LVCSR
TLDR
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
Modeling long temporal contexts in convolutional neural network-based phone recognition
  • L. Tóth
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
The deep neural network component of current hybrid speech recognizers is trained on a context of consecutive feature vectors. Here, we investigate whether the time span of this input can be extended
Very deep convolutional neural networks for LVCSR
TLDR
Very deep CNNs are introduced for LVCSR task, by extending depth of convolutional layers up to ten, and a better way to perform convolution operations on temporal dimension is proposed.
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
Highway long short-term memory RNNS for distant speech recognition
TLDR
This paper extends the deep long short-term memory (DL-STM) recurrent neural networks by introducing gated direct connections between memory cells in adjacent layers, and introduces the latency-controlled bidirectional LSTMs (BLSTMs) which can exploit the whole history while keeping the latency under control.
An analysis of convolutional neural networks for speech recognition
  • J. Huang, Jinyu Li, Y. Gong
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices.
Far-field speech recognition using CNN-DNN-HMM with convolution in time
TLDR
Experimental results show that a CNN coupled with a fully connected DNN can model short time correlations in feature vectors with fewer parameters than a DNN and thus generalise better to unseen test environments.
...
1
2
3
4
...