Recurrent Models for Auditory Attention in Multi-Microphone Distant Speech Recognition

@article{Kim2016RecurrentMF,
  title={Recurrent Models for Auditory Attention in Multi-Microphone Distant Speech Recognition},
  author={Suyoun Kim and Ian R. Lane},
  journal={ArXiv},
  year={2016},
  volume={abs/1511.06407}
}
Integration of multiple microphone data is one of the key ways to achieve robust speech recognition in noisy environments or when the speaker is located at some distance from the input device. Signal processing techniques such as beamforming are widely used to extract a speech signal of interest from background noise. These techniques, however, are highly dependent on prior spatial information about the microphones and the environment in which the system is being used. In this work, we present… 

Figures and Tables from this paper

Stream Attention for Distributed Multi-Microphone Speech Recognition
TLDR
This paper investigates the ASR performance measures using the proposed stream attention system on real recorded datasets, Mixer-6 and DIRHA-WSJ, and shows that the proposed framework yields substantial improvements in word error rate (WER) compared to conventional strategies.
Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition
TLDR
New acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly are developed and incorporated into the acoustic model.
End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition
TLDR
This paper proposes introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources.
Attention-Based LSTM with Multi-Task Learning for Distant Speech Recognition
TLDR
This paper explores the attention mechanism embedded within the long short-term memory (LSTM) based acoustic model for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM).
Multi-geometry Spatial Acoustic Modeling for Distant Speech Recognition
TLDR
This work proposes to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input and demonstrates the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data.
Multi-channel Attention for End-to-End Speech Recognition
TLDR
This work proposes a sensory attention mechanism that is invariant to the channel ordering and only increases the overall parameter count by 0.09%, and demonstrates that even without re-training, this attention-equipped end-to-end model is able to deal with arbitrary numbers of input channels during inference.
Robust Multi-Channel Speech Recognition Using Frequency Aligned Network
TLDR
This paper uses frequency aligned network for robust multi-channel automatic speech recognition (ASR) and shows that this modification not only reduces the number of parameters in the model but also significantly and improves the ASR performance.
Stream Attention for far-field multi-microphone ASR
TLDR
A stream attention framework has been applied to the posterior probabilities of the deep neural network to improve the far-field automatic speech recognition (ASR) performance in the multi-microphone configuration and has yielded substantial improvements in word error rate (WER).
Learning representations for speech recognition using artificial neural networks
TLDR
This thesis proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states.
...
1
2
3
...

References

SHOWING 1-10 OF 48 REFERENCES
Deep beamforming networks for multi-channel speech recognition
TLDR
This work proposes to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network that obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.
Neural networks for distant speech recognition
  • S. Renals, P. Swietojanski
  • Physics, Computer Science
    2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA)
  • 2014
TLDR
This paper investigates the use of convolutional and fully-connected neural networks with different activation functions (sigmoid, rectified linear, and maxout) for distant speech recognition of meetings recorded using microphone arrays, and indicates that neural network models are capable of significant improvements in accuracy compared with discriminatively trained Gaussian mixture models.
Likelihood-maximizing beamforming for robust hands-free speech recognition
TLDR
A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis.
Using neural network front-ends on far field multiple microphones based speech recognition
TLDR
Results presented in this paper indicate that channel concatenation gives similar or better results than beamforming, andAugmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over theStandard DNN based recognition system, and yields additional improvements for far field speech recognition.
Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms
TLDR
This paper presents an algorithm to do multichannel enhancement jointly with the acoustic model, using a raw waveform convolutional LSTM deep neural network (CLDNN), and shows that training such a network on inputs captured using multiple (linear) array configurations results in a model that is robust to a range of microphone spacings.
Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition
TLDR
A feature transformation for removing reverberation and background noise artefacts from bottleneck features using DNN trained to learn the mapping between distant-talking speech features and close- talking speech bottleneck features is proposed.
Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors
TLDR
Performance comparisons of spherical and linear arrays reveal that a spherical array with a diameter of 8.4 cm can provide recognition accuracy comparable or better than that obtained with a large linear array with an aperture length of 126 cm.
Distant speech separation using predicted time-frequency masks from spatial features
Convolutional Neural Networks for Distant Speech Recognition
TLDR
This work investigates convolutional neural networks for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM), and proposes a channel-wise convolution with two-way pooling.
Far-field speech recognition using CNN-DNN-HMM with convolution in time
TLDR
Experimental results show that a CNN coupled with a fully connected DNN can model short time correlations in feature vectors with fewer parameters than a DNN and thus generalise better to unseen test environments.
...
1
2
3
4
5
...