Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation

@inproceedings{Wang2018IntegratingSA,
  title={Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation},
  author={Zhong-Qiu Wang and Deliang Wang},
  booktitle={INTERSPEECH},
  year={2018}
}
This paper tightly integrates spectral and spatial information for deep learning based multi-channel speaker separation. The key idea is to localize individual speakers so that an enhancement network can be used to separate the speaker from an estimated direction and with specific spectral characteristics. To determine the direction of the speaker of interest, we identify time-frequency (T-F) units dominated by that speaker and only use them for direction of arrival (DOA) estimation. The… 

Figures and Tables from this paper

Direction-Aware Speaker Beam for Multi-Channel Speaker Extraction
TLDR
The proposed scheme largely improves the existing multi-channel SpeakerBeam in low signal-to-interference ratio or same-gender scenarios and tightly integrate spectral and spatial information for target speaker extraction.
Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information
TLDR
This paper integrates an attention mechanism to dynamically tune the model’s attention to the reliable input features to alleviate spatial ambiguity problem when multiple speakers are closely located and significantly improves the performance of speech separation against the baseline single-channel and multi-channel speech separation methods.
Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments
TLDR
It is shown that rank-1 approximation of a speech covariance matrix based on generalized eigenvalue decomposition leads to the best results for the masking-based MVDR beamformer.
Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement
TLDR
Systematic evaluations and comparisons on the NIST SRE 2010 retransmitted corpus show that both monaural and multi-channel speech enhancement significantly outperform x-vector's performance, and the covariance matrix estimate is effective for the MVDR beamformer.
Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network
TLDR
A network framework that combines a residual network, a recurring network, and a fully connected network was used for exploiting correlation information of frequency in this work, and shows 1.6 dB SDR improvement over the previous state-of-the-art methods.
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation
TLDR
A temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture in reverberant environments, assisted with directional information of the speaker(s), and a fully-convolutional autoencoder framework, which is purely end-to-end and single-pass.
Speaker and Direction Inferred Dual-Channel Speech Separation
TLDR
This work proposes a speaker and direction inferred speech separation network (dubbed SDNet) to solve the cocktail party problem and generates more precise perceptual representations with the help of spatial features and successfully deals with the problem of the unknown number of sources and the selection of outputs.
Efficient Integration of Multi-channel Information for Speaker-independent Speech Separation
TLDR
It was found that the proposed methods can outperform multi-channel deep clustering and improve the performance proportionally to the number of microphones, and it was proven that the performance of the late-fusion method is consistently higher than that of the single-channel method regardless of the angle difference between speakers.
Multi-band PIT and Model Integration for Improved Multi-channel Speech Separation
  • Lianwu ChenMeng YuDan SuDong Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
The most recent models of multi-channel permutation invariant training (PIT), investigate spatial features formed by microphone pairs and their underlying impact and issue, present a multi-band architecture for effective feature encoding, and conduct a model integration between single-channel and multi-Channel PIT for resolving the spatial overlapping problem in the conventional multi- channel PIT framework are reviewed.
Improvement of Spatial Ambiguity in Multi-Channel Speech Separation Using Channel Attention
TLDR
This study proposes an attention mechanism for the Temporal-Spatial Neural Filter (TSNF), in which the channel attention on merged features and the feature map of 1D convolution block in the temporal convolution network is proposed.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation
TLDR
It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Single-Channel Multi-Speaker Separation Using Deep Clustering
TLDR
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
Multichannel Spatial Clustering Using Model-Based Source Separation
TLDR
This chapter will discuss several approaches to unsupervised spatial clustering, with a focus on model-based expectation maximization source separation and localization (MESSL), and the basic two-microphone version of this model, which clusters spectrogram points based on the relative differences in phase and level between pairs of microphones.
Deep Clustering-Based Beamforming for Separation with Unknown Number of Sources
TLDR
This paper extends a deep clustering algorithm for use with time-frequency masking-based beamforming and performs separation with an unknown number of sources and achieves comparable source separation performance to that obtained with a complex Gaussian mixture model- based beamformer.
A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR
TLDR
The core of the algorithm estimates a time-frequency mask which represents the target speech and use masking-based beamforming to enhance corrupted speech and propose a masked-based post-filter to further suppress the noise in the output of beamforming.
A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation
TLDR
This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model
This paper addresses the modeling of reverberant recording environments in the context of under-determined convolutive blind source separation. We model the contribution of each source to all mixture
Blind Speech Separation and Enhancement With GCC-NMF
We present a blind source separation algorithm named GCC-NMF that combines unsupervised dictionary learning via non-negative matrix factorization (NMF) with spatial localization via the generalized
Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment
TLDR
A blind source separation method for convolutive mixtures of speech/audio sources that can be applied to an underdetermined case where there are fewer microphones than sources is presented.
Speaker-Independent Speech Separation With Deep Attractor Network
TLDR
This work proposes a novel deep learning framework for speech separation that uses a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space and proposes three methods for finding the attractors for each source in the embedded space and compares their advantages and limitations.
...
...