Mask-Dependent Phase Estimation for Monaural Speaker Separation

  title={Mask-Dependent Phase Estimation for Monaural Speaker Separation},
  author={Zhaoheng Ni and Michael I. Mandel},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Zhaoheng Ni, Michael I. Mandel
  • Published 7 November 2019
  • Physics
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speaker separation refers to isolating speech of interest in a multi-talker environment. Most methods apply real-valued Time-Frequency (T-F) masks to the mixture Short-Time Fourier Transform (STFT) to reconstruct the clean speech. Hence there is an unavoidable mismatch between the phase of the reconstruction and the original phase of the clean speech. In this paper, we propose a simple yet effective phase estimation network that predicts the phase of the clean speech based on a T-F mask… 

Figures and Tables from this paper

End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain
This paper combines the deep complex network (DCN) and Conv-TasNet to design an end-to-end complexvalued model, and incorporates short-time Fourier transform (STFT) and learnable complex layers to build a hybrid encoder-decoder structure, and uses a DCN based separator.
Phase-aware subspace decomposition for single channel speech separation
The extensive evaluation under different test scenarios proves that PASD consistently improves the quality of the separated signals, compared to other benchmark approaches.
Joint Amplitude and Phase Refinement for Monaural Source Separation
The alternating direction method of multipliers (ADMM) is utilized to find time-domain signals whose amplitude spectrograms are close to the given ones in terms of the generalized alpha-beta divergences and confirmed the effectiveness of the proposed method through speech-nonspeech separation.
UltraSE: single-channel speech enhancement using ultrasound
This paper proposes UltraSE, which uses ultrasound sensing as a complementary modality to separate the desired speaker's voice from interferences and noise and introduces a multi-modal, multi-domain deep learning framework to fuse the ultrasonic Doppler features and the audible speech spectrogram.


End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its
Complex Ratio Masking for Monaural Speech Separation
The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective
This study investigates phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain and proposes three algorithms based on iterative phase reconstruction, group delay estimation, and phase-difference sign prediction.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks
A phase-sensitive objective function based on the signal-to-noise ratio (SNR) of the reconstructed signal is developed, and it is shown that in experiments it yields uniformly better results in terms of signal- to-distortion ratio (SDR).
Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation
The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives, with a modest model size.
Phasebook and Friends: Leveraging Discrete Representations for Source Separation
These methods are evaluated on the wsj0-2mix dataset, a well-studied corpus for single-channel speaker-independent speaker separation, matching the performance of state-of-the-art mask-based approaches without requiring additional phase reconstruction steps.
Single-Channel Multi-Speaker Separation Using Deep Clustering
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
Speech dereverberation and denoising using complex ratio masks
A deep neural network is used to estimate the real and imaginary components of the complex ideal ratio mask (cIRM), which results in clean and anechoic speech when applied to a reverberant-noisy mixture and shows that phase is important for dereverberation, and that complex ratio masking outperforms related methods.
Looking to listen at the cocktail party
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.