Factorial models and refiltering for speech separation and denoising


This paper proposes the combination of several ideas, some old and some new, from machine learning and speech processing. We review the max approximation to log spectrograms of mixtures, show why this motivates a “refiltering” approach to separation and denoising, and then describe how the process of inference in factorial probabilistic models performs a computation useful for deriving the masking signals needed in refiltering. A particularly simple model, factorial-max vector quantization (MAXVQ), is introduced along with a branch-and-bound technique for efficient exact inference and applied to both denoising and monaural separation. Our approach represents a return to the ideas of Ephraim, Varga and Moore but applied to auditory scene analysis rather than to speech recognition. 1. Sparsity & Redundancy in Spectrograms 1.1. The Log-MaxApproximation When two clean speech signals are mixed additively in the time domain, what is the relationship between the individual log spectrograms of the sources and the log spectrogram of the mixture? Unless the sources are highly dependent (synchronized), the spectrogram of the mixture is almost exactly the maximum of the individual spectrograms, with the maximum operating over small time-frequency regions (fig. 2). This amazing fact, first noted by Roger Moore in 1983, comes from the fact that unless e1 and e2 are both large and almost equal, log(e1 + e2) ≈ max(log e1, log e2) (fig. 1a). The sparse nature of the speech code across time and frequency is the key to the practical usefulness of this approximation: most narrow frequency bands carry substantial energy only a small fraction of the time and thus it is rare that two independent sources inject large amounts of energy into the same subband at the same time. (Figure 1b shows a plot of the relative energy of two simultaneous speakers in a narrow subband; most of the time at least one of the two sources shows negligible power.) 1.2. Masking and Refiltering Fortunately, the speech code is also redundant across timefrequency. Different frequency bands carry, to a certain extent, independent information and so if information in some bands is suppressed or masked, even for significant durations, other bands can fill in. (A similar effect occurs over time: if brief sections of the signal are obscured, even across all bands, the speech is still intelligible; while also useful, we do not exploit this here.) This is partly why humans perform so well on many monaural speech separation and denoising tasks. When we solve the cocktail party problem or recognize degraded speech, we are doing structural analysis, or a kind of “perceptual grouping” on the incoming sound. There is substantial evidence that the appropriate subparts of an audio signal for use in grouping may be narrow frequency bands over short times. To generate these parts computationally, we can perform multiband analysis – break the original speech signal y(t) into many subband signals bi(t) each lo g e2 ma

Extracted Key Phrases

4 Figures and Tables

Citations per Year

245 Citations

Semantic Scholar estimates that this publication has 245 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Roweis2003FactorialMA, title={Factorial models and refiltering for speech separation and denoising}, author={Sam T. Roweis}, booktitle={INTERSPEECH}, year={2003} }