• Corpus ID: 211032118

Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features

@article{Fan2020SpatialAS,
  title={Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features},
  author={Cunhang Fan and B. Liu and Jianhua Tao and Jiangyan Yi and Zhengqi Wen},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.01626}
}
Multi-channel deep clustering (MDC) has acquired a good performance for speech separation. However, MDC only applies the spatial features as the additional information. So it is difficult to learn mutual relationship between spatial and spectral features. Besides, the training objective of MDC is defined at embedding vectors, rather than real separated sources, which may damage the separation performance. In this work, we propose a deep attention fusion method to dynamically control the weights… 

Figures and Tables from this paper

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations
TLDR
The gated recurrent fusion (GRF) method is proposed to adaptively select and fuse the relevant information from spectral and spatial features by making use of the gate and memory modules to solve the training objective problem of MDC.
Simultaneous Denoising and Dereverberation Using Deep Embedding Features
TLDR
A joint training method for simultaneous speech denoising and dereverberation using deep embedding features, which is based on the deep clustering (DC).
Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method
TLDR
Experimental results on the WSJ0-2mix dataset show that the proposed end-to-end post-filter method with deep attention fusion features outperforms the state-of-the-art speech separation method.
Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations
TLDR
A joint training method for simultaneous speech denoising and dereverberation using deep embedding representations that outperforms the WPE and BLSTM baselines and can be simultaneously optimized.
End-to-End Post-Filter for Speech Separation With Deep Attention Fusion Features
TLDR
Experimental results on the WSJ0-2mix dataset show that the proposed end-to-end post-filter method with deep attention fusion features outperforms the state-of-the-art speech separation method.
Implicit Filter-and-sum Network for Multi-channel Speech Separation
TLDR
The proposed modification to the FaSNet, which is referred to as iFaS net, is able to significantly outperform the benchmark FaS net across all conditions with an on par model complexity.
Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition
TLDR
A gated recurrent fusion (GRF) method with joint training framework for robust end-to-end automatic speech recognition (ASR) that can achieves better performances with 12.67% CER reduction, which suggests the potential of the proposed method.

References

SHOWING 1-10 OF 21 REFERENCES
Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation
TLDR
It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation
This study tightly integrates complementary spectral and spatial features for deep learning based multi-channel speaker separation in reverberant environments. The key idea is to localize individual
Spatial Constraint on Multi-channel Deep Clustering
  • M. Togami
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Experimental results show that multi-channel deep clustering with the proposed input feature based on the estimated direction-of-arrival (DOA) can separate speech sources better than the conventional multi- channeldeep clustering that stacks embeddings of all the pairs of the microphones.
Deep clustering: Discriminative embeddings for segmentation and separation
TLDR
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
Single-Channel Multi-Speaker Separation Using Deep Clustering
TLDR
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features
TLDR
The proposed models achieve better performances than DC and uPIT for speaker-independent speech separation and maximize the distance of each permutation, which is applied to fine tuning the whole model.
Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation
This paper tightly integrates spectral and spatial information for deep learning based multi-channel speaker separation. The key idea is to localize individual speakers so that an enhancement network
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation
TLDR
A uPIT with discriminative learning (uPITDL) method to solve the problem of speaker independent speech separation by adding one regularization at the cost function, which minimizes the difference between the outputs of model and their corresponding reference signals.
Supervised Speech Separation Based on Deep Learning: An Overview
TLDR
This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years, and provides a historical perspective on how advances are made.
...
...