One Microphone Source Separation

Abstract

Source separation, or computational auditory scene analysis, attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as ICA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (“masking”) of frequency sub-bands from a single recording, and argue for the application of statistical algorithms to learning this masking function. I present results of a simple factorial HMM system which learns on recordings of single speakers and can then separate mixtures using only one observation signal by computing the masking function and then refiltering. 1 Learning from data in computational auditory scene analysis Imagine listening to many pianos being played simultaneously. If each pianist were striking keys randomly it would be very difficult to tell which note came from which piano. But if each were playing a coherent song, separation would be much easier because of the structure of music. Now imagine teaching a computer to do the separation by showing it many musical scores as “training data”. Typical auditory perceptual input contains a mixture of sounds from different sources, altered by the acoustic environment. Any biological or artificial hearing system must extract individual acoustic objects or streams in order to do successful localization, denoising and recognition. Bregman [1] called this process auditory scene analysis in analogy to vision. Source separation, or computational auditory scene analysis (CASA) is the practical realization of this problem via computer analysis of microphone recordings and is very similar to the musical task described above. It has been investigated by research groups with different emphases. The CASA community have focused on both multiple and single microphone source separation problems under highly realistic acoustic conditions, but have used almost exclusively hand designed systems which include substantial knowledge of the human auditory system and its psychophysical characteristics (e.g. [2,3]). Unfortunately, it is difficult to incorporate large amounts of detailed statistical knowledge about the problem into such an approach. On the other hand, machine learning researchers, especially those working on independent components analysis (ICA) and related algorithms, have focused on the case of multiple microphones in simplified mixing environments and have used powerful “blind” statistical techniques. These “unmixing” algorithms (even those which attempt to recover more sources than signals) cannot operate on single recordings. Furthermore, since they often depend only on the joint amplitude histogram of the observations they can be very sensitive to the details of filtering and reverberation in the environment. The goal of this paper is to bring together the robust representations of CASA and methods which learn from data to solve a restricted version of the source separation problem – isolating acoustic objects from only a single microphone recording. 2 Refiltering vs. unmixing Unmixing algorithms reweight multiple simultaneous recordings mk(t) (generically called microphones) to form a new source object s(t): s(t) |{z} estimated source = 1m1(t) | {z } mic 1 + 2m2(t) | {z } mic 2 + : : : + K mK(t) | {z } mic K (1) The unmixing coefficients i are constant over time and are chosen to optimize some property of the set of recovered sources, which often translates into a kurtosis measure on the joint amplitude histogram of the microphones. The intuition is that unmixing algorithms are finding spikes (or dents for low kurtosis sources) in the marginal amplitude histogram. The time ordering of the datapoints is often irrelevant. Unmixing depends on a fine timescale, sample-by-sample comparison of several observation signals. Humans, on the other hand, cannot hear histogram spikes1 and perform well on many monaural separation tasks. We are doing structural analysis, or a kind of perceptual grouping on the incoming sound. But what is being grouped? There is substantial evidence that the energy across time in different frequency bands can carry relatively independent information. This suggests that the appropriate subparts of an audio signal may be narrow frequency bands over short times. To generate these parts, one can perform multiband analysis – break the original signal y(t) into many subband signals bi(t) each filtered to contain only energy from a small portion of the spectrum. The results of such an analysis are often displayed as a spectrogram which shows energy (using colour or grayscale) as a function of time (ordinate) and frequency (abscissa). (For example one is shown on the top left of figure 5.) In the musical analogy, a spectrogram is like a musical score in which the colour or grey level of the each note tells you how hard to hit the piano key. The basic idea of refiltering is to construct new sources by selectively reweighting the multiband signals bi(t). Crucially, however, the mixing coefficients are no longer constant over time; they are now called masking signals. Given a set of masking signals, denoted i(t), a source s(t) can be recovered by modulating the corresponding subband signals from the original input and summing: s(t) |{z} estimated source = mask 1 z }| { 1(t) b1(t) |{z} sub-band 1 +mask 2 z }| { 2(t) b2(t) |{z} sub-band 2 + : : :+ mask K z }| { K(t) bK(t) | {z } sub-band K (2) The i(t) are gain knobs on each subband that we can twist over time to bring bands in and out of the source as needed. This performs masking on the original spectrogram. (An equivalent operation can be performed in the frequency domain.2) This approach, illustrated in figure 1, forms the basis of many CASA approaches (e.g. [2,3,4]). For any specific choice of masking signals i(t), refiltering attempts to isolate a single source from the input signal and suppress all other sources and background noises. Different sources can be isolated by choosing different masking signals. Henceforth, I will make a strong simplifying assumption that i(t) are binary and constant over a timescale of roughly 30ms. This is physically unrealistic, because the energy in each small region of time-frequency never comes entirely from a single source. However in practice, for small numbers of sources, this approximation works quite well (figure 3). (Think of ignoring collisions by assuming separate piano players do not often hit the same note at the same time.) Try randomly permuting the time order of samples in a stereo mixture containing several sources and see if you still hear distinct streams when you play it back. Make a conventional spectrogram of the original signal y(t) and modulate the magnitude of each short time DFT while preserving its phase: sw( ) = F 1 f wkFfyw( )gk\Ffyw( )gg where sw( ) and yw( ) are the wth windows (blocks) of the recovered and original signals, wi is the masking signal for subband i in window w, and F [ ℄ is the DFT. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 50

Extracted Key Phrases

6 Figures and Tables

02040'01'03'05'07'09'11'13'15'17
Citations per Year

433 Citations

Semantic Scholar estimates that this publication has 433 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Roweis2000OneMS, title={One Microphone Source Separation}, author={Sam T. Roweis}, booktitle={NIPS}, year={2000} }