A two-stage approach for improving the perceptual quality of separated speech


Binary time-frequency masking and model-based nonnegative matrix factorization (NMF) are two common approaches to speech separation. However, binary masking often suffers from poor perceptual quality, while NMF typically requires pretrained models for both speech and noise and frequently does not perform well. In this paper we examine whether a single or two-stage approach should be used for performing separation. We propose a two-stage algorithm that uses a soft mask in the first stage for separation, and NMF in the second stage for improving perceptual quality where only a speech model needs to be trained. We show that the proposed two-stage approach achieves higher objective perceptual quality and intelligibility compared to related single-stage methods.

DOI: 10.1109/ICASSP.2014.6854964

Extracted Key Phrases

4 Figures and Tables

Showing 1-10 of 15 references

A Non-negative Framework for Joint Modeling of Spectral Structure and Temporal Dynamics in Sound Mixtures

  • G J Mysore
  • 2010
2 Excerpts
Showing 1-9 of 9 extracted citations