Convolutional Neural Networks for Distant Speech Recognition

@article{Swietojanski2014ConvolutionalNN,
  title={Convolutional Neural Networks for Distant Speech Recognition},
  author={Pawel Swietojanski and Arnab Ghoshal and Steve Renals},
  journal={IEEE Signal Processing Letters},
  year={2014},
  volume={21},
  pages={1120-1124}
}
We investigate convolutional neural networks (CNNs) for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM). In the MDM case we explore a beamformed signal input representation compared with the direct use of multiple acoustic channels as a parallel input to the CNN. We have explored different weight sharing approaches, and propose a channel-wise convolution with two-way pooling. Our experiments… 

Figures and Tables from this paper

Convolutional Neural Network and Feature Transformation for Distant Speech Recognition
TLDR
It is argued that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation.
An analysis of convolutional neural networks for speech recognition
  • J. Huang, Jinyu Li, Y. Gong
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices.
Far-field speech recognition using CNN-DNN-HMM with convolution in time
TLDR
Experimental results show that a CNN coupled with a fully connected DNN can model short time correlations in feature vectors with fewer parameters than a DNN and thus generalise better to unseen test environments.
Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
On using parameterized multi-channel non-causal Wiener filter-adapted convolutional neural networks for distant speech recognition
TLDR
Experimental results on the TIMIT dataset show that the proposed PMWF-based CNN approach outperforms the cross-channel CNN and the DS beamformer when evaluating the word error rate (WER) in various DSR environments.
Explorer Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Attention-Based LSTM with Multi-Task Learning for Distant Speech Recognition
TLDR
This paper explores the attention mechanism embedded within the long short-term memory (LSTM) based acoustic model for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM).
Multiresolution CNN for reverberant speech recognition
  • Sunchan Park, Yongwon Jeong, H. S. Kim
  • Computer Science
    2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)
  • 2017
TLDR
This work proposes a multiresolution CNN that has two separate streams: one is the wideband feature with wide-context window and the other is the narrow band feature with narrow- context window to improve the performance of reverberant speech recognition using CNN acoustic models.
Neural networks for distant speech recognition
  • S. Renals, P. Swietojanski
  • Physics, Computer Science
    2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA)
  • 2014
TLDR
This paper investigates the use of convolutional and fully-connected neural networks with different activation functions (sigmoid, rectified linear, and maxout) for distant speech recognition of meetings recorded using microphone arrays, and indicates that neural network models are capable of significant improvements in accuracy compared with discriminatively trained Gaussian mixture models.
...
...

References

SHOWING 1-10 OF 51 REFERENCES
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Exploring convolutional neural network structures and optimization techniques for speech recognition
TLDR
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced.
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks
TLDR
This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal.
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM
TLDR
This paper presents the strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework, and shows that DNNs provide the flexibility of using arbitrary features.
Convolutional networks for speech detection
TLDR
A network architecture that can incorporate long and short-term temporal and spectral correlations of speech in the detection process using convolutional networks is proposed, able to address many shortcomings of existing speech detectors in a unified new framework.
Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors
TLDR
Performance comparisons of spherical and linear arrays reveal that a spherical array with a diameter of 8.4 cm can provide recognition accuracy comparable or better than that obtained with a large linear array with an aperture length of 126 cm.
...
...