• Corpus ID: 239616154

TADRN: Triple-Attentive Dual-Recurrent Network for Ad-hoc Array Multichannel Speech Enhancement

  title={TADRN: Triple-Attentive Dual-Recurrent Network for Ad-hoc Array Multichannel Speech Enhancement},
  author={Ashutosh Pandey and Buye Xu and Anurag Kumar and Jacob Donley and Paul T. Calamia and Deliang Wang},
Deep neural networks (DNNs) have been successfully used for multichannel speech enhancement in fixed array geometries. However, challenges remain for ad-hoc arrays with unknown microphone placements. We propose a deep neural network based approach for ad-hoc array processing: Triple-Attentive Dual-Recurrent Network (TADRN). TADRN uses self-attention across channels for learning spatial information and a dual-path attentive recurrent network (ARN) for temporal modeling. Temporal modeling is done… 

Figures and Tables from this paper

Multichannel Speech Enhancement without Beamforming
This work proposes a two-stage strategy for multi-channel speech enhancement that does not need a beamformer for additional performance and proposes a novel attentive dense convolutional network (ADCN) for predicting real and imaginary parts of complex spectrogram.


Channel-Attention Dense U-Net for Multichannel Speech Enhancement
This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming.
Deep Ad-hoc Beamforming
Results on speech enhancement tasks show that the proposed deep ad-hoc beamforming framework outperforms its counterpart that works with linear microphone arrays by a considerable margin in both diffuse noise reverberant environments and point source noise reverberants environments.
Continuous Speech Separation with Ad Hoc Microphone Arrays
Experimental results for AdHoc-LibiCSS, a new dataset consisting of continuous recordings of concatenated LibriSpeech utterances obtained by multiple different devices, show the proposed separation method can significantly improve the ASR accuracy for overlapped speech with little performance degradation for single talker segments.
Neural Speech Separation Using Spatially Distributed Microphones
Speech recognition experimental results show that the proposed neural network based speech separation method significantly outperforms baseline multi-channel speech separation systems.
Multi-Channel Speech Enhancement Using Graph Neural Networks
This paper views each audio channel as a node lying in a non-Euclidean space and, specifically, a graph, which allows them to be applied to apply graph neural networks (GNN) to find spatial correlations among the different channels (nodes).
End-to-End Multi-Channel Speech Separation
This paper proposes a new end-to-end model for multi-channel speech separation that reformulate the traditional short time Fourier transform and inter-channel phase difference as a function of time-domain convolution with a special kernel.
End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation
This paper proposes transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation based on the filter-and-sum network, and shows how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays.
Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation
This study tightly integrates complementary spectral and spatial features for deep learning based multi-channel speaker separation in reverberant environments. The key idea is to localize individual
Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation
It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks
It is shown that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.