Simplifying very deep convolutional neural network architectures for robust speech recognition

@article{Rownicka2017SimplifyingVD,
  title={Simplifying very deep convolutional neural network architectures for robust speech recognition},
  author={Joanna Rownicka and Steve Renals and Peter Bell},
  journal={2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2017},
  pages={236-243}
}
Very deep convolutional neural networks (VDCNNs) have been successfully used in computer vision. More recently VDCNNs have been applied to speech recognition, using architectures adopted from computer vision. In this paper, we experimentally analyse the role of the components in VDCNN architectures for robust speech recognition. We have proposed a number of simplified VDCNN architectures, taking into account the use of fully-connected layers and down-sampling approaches. We have investigated… 

Figures and Tables from this paper

Evaluation of Modified Deep Neural Network Architecture Performance for Speech Recognition
TLDR
Four different Deep Neural Network (DNN) architectures are proposed and comparison is made between these four proposed DNN architectures in terms of accuracy and training time and modified triangular architecture gave the highest accuracy as compared to other architectures.
Multi-Scale Octave Convolutions for Robust Speech Recognition
  • Joanna Rownicka, P. Bell, S. Renals
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
It is argued that octave convolutions likewise improve the robustness of learned representations due to the use of average pooling in the lower resolution group, acting as a low-pass filter, while improving the computational efficiency of the CNN acoustic models.
Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation
TLDR
It is found that deep CNN embeddings outperform DNNembeddings for acoustic model adaptation and auxiliary features based on deep CNN embedded features result in similar word error rates to i-vectors.
Embeddings for DNN Speaker Adaptive Training
TLDR
The performance for speaker recognition of a given representation is not correlated with its ASR performance; in fact, ability to capture more speech attributes than just speaker identity was the most important characteristic of the embed-dings for efficient DNN-SAT ASR.
Novel Demodulation-Based Features using Classifier-level Fusion of GMM and CNN for Replay Detection
TLDR
The architecture with max-pooling when replaced with convolutional layer along with FC layers had performed relatively better on most of the AM-FM feature sets compared to other CNNs, and the ESA-based AM features performed better as AM do not have more fluctuation as FM have during models training.
Automatic Database Segmentation using Hybrid Spectrum -Visual Approach
  • Manar Gbaily
  • Computer Science
    The Egyptian Journal of Language Engineering
  • 2021
TLDR
A novel method of segmentation of speech phonemes, where the proposed strategy helps in the selection of appropriate feature extraction technique for speech segmentation has the potential to be used in applications such as automatic speech recognition and automatic language identification.
An Art of Speech Recognition: A Review
TLDR
This paper provides literature review on the various feature extraction and classification methods used in speech recognition system.
Design of Countermeasures for Replay Spoof Speech Attack
TLDR
The following is a list of principal Symbols and Acronyms used in medicine, as well as some of their applications in other fields.
Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.

References

SHOWING 1-10 OF 30 REFERENCES
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition
TLDR
The proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition and are competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model.
Very deep convolutional neural networks for robust speech recognition
  • Y. Qian, P. Woodland
  • Computer Science
    2016 IEEE Spoken Language Technology Workshop (SLT)
  • 2016
TLDR
The extension and optimisation of previous work on very deep convolutional neural networks for effective recognition of noisy speech in the Aurora 4 task are described and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective.
Convolutional Neural Networks for Speech Recognition
TLDR
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Advances in Very Deep Convolutional Neural Networks for LVCSR
TLDR
This paper proposes a new CNN design without timepadding and without timepooling, which is slightly suboptimal for accuracy, but has two significant advantages: it enables sequence training and deployment by allowing efficient convolutional evaluation of full utterances, and, it allows for batch normalization to be straightforwardly adopted to CNNs on sequence data.
Very deep multilingual convolutional neural networks for LVCSR
TLDR
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
Deep Convolutional Neural Networks for Large-scale Speech Tasks
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced.
An analysis of convolutional neural networks for speech recognition
  • J. Huang, Jinyu Li, Y. Gong
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices.
Convolutional Neural Networks for Distant Speech Recognition
TLDR
This work investigates convolutional neural networks for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM), and proposes a channel-wise convolution with two-way pooling.
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
...
...