An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech
Binary time-frequency masking and model-based nonnegative matrix factorization (NMF) are two common approaches to speech separation. However, binary masking often suffers from poor perceptual quality, while NMF typically requires pretrained models for both speech and noise and frequently does not perform well. In this paper we examine whether a single or two-stage approach should be used for performing separation. We propose a two-stage algorithm that uses a soft mask in the first stage for separation, and NMF in the second stage for improving perceptual quality where only a speech model needs to be trained. We show that the proposed two-stage approach achieves higher objective perceptual quality and intelligibility compared to related single-stage methods.