Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features
@article{Fan2020SpatialAS, title={Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features}, author={Cunhang Fan and B. Liu and Jianhua Tao and Jiangyan Yi and Zhengqi Wen}, journal={ArXiv}, year={2020}, volume={abs/2002.01626} }
Multi-channel deep clustering (MDC) has acquired a good performance for speech separation. However, MDC only applies the spatial features as the additional information. So it is difficult to learn mutual relationship between spatial and spectral features. Besides, the training objective of MDC is defined at embedding vectors, rather than real separated sources, which may damage the separation performance. In this work, we propose a deep attention fusion method to dynamically control the weights…
7 Citations
Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations
- Computer ScienceINTERSPEECH
- 2020
The gated recurrent fusion (GRF) method is proposed to adaptively select and fuse the relevant information from spectral and spatial features by making use of the gate and memory modules to solve the training objective problem of MDC.
Simultaneous Denoising and Dereverberation Using Deep Embedding Features
- Computer ScienceArXiv
- 2020
A joint training method for simultaneous speech denoising and dereverberation using deep embedding features, which is based on the deep clustering (DC).
Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method
- Computer ScienceArXiv
- 2020
Experimental results on the WSJ0-2mix dataset show that the proposed end-to-end post-filter method with deep attention fusion features outperforms the state-of-the-art speech separation method.
Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations
- Computer ScienceINTERSPEECH
- 2020
A joint training method for simultaneous speech denoising and dereverberation using deep embedding representations that outperforms the WPE and BLSTM baselines and can be simultaneously optimized.
End-to-End Post-Filter for Speech Separation With Deep Attention Fusion Features
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2020
Experimental results on the WSJ0-2mix dataset show that the proposed end-to-end post-filter method with deep attention fusion features outperforms the state-of-the-art speech separation method.
Implicit Filter-and-sum Network for Multi-channel Speech Separation
- Computer ScienceArXiv
- 2020
The proposed modification to the FaSNet, which is referred to as iFaS net, is able to significantly outperform the benchmark FaS net across all conditions with an on par model complexity.
Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition
- Computer Science, EngineeringIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
A gated recurrent fusion (GRF) method with joint training framework for robust end-to-end automatic speech recognition (ASR) that can achieves better performances with 12.67% CER reduction, which suggests the potential of the proposed method.
References
SHOWING 1-10 OF 21 REFERENCES
Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation
- Physics2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation
- PhysicsIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2019
This study tightly integrates complementary spectral and spatial features for deep learning based multi-channel speaker separation in reverberant environments. The key idea is to localize individual…
Spatial Constraint on Multi-channel Deep Clustering
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
Experimental results show that multi-channel deep clustering with the proposed input feature based on the estimated direction-of-arrival (DOA) can separate speech sources better than the conventional multi- channeldeep clustering that stacks embeddings of all the pairs of the microphones.
Deep clustering: Discriminative embeddings for segmentation and separation
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
Single-Channel Multi-Speaker Separation Using Deep Clustering
- Computer ScienceINTERSPEECH
- 2016
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features
- Computer ScienceINTERSPEECH
- 2019
The proposed models achieve better performances than DC and uPIT for speaker-independent speech separation and maximize the distance of each permutation, which is applied to fine tuning the whole model.
Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation
- PhysicsINTERSPEECH
- 2018
This paper tightly integrates spectral and spatial information for deep learning based multi-channel speaker separation. The key idea is to localize individual speakers so that an enhancement network…
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation
- Computer Science2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP)
- 2018
A uPIT with discriminative learning (uPITDL) method to solve the problem of speaker independent speech separation by adding one regularization at the cost function, which minimizes the difference between the outputs of model and their corresponding reference signals.
Supervised Speech Separation Based on Deep Learning: An Overview
- PhysicsIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2018
This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years, and provides a historical perspective on how advances are made.