A Style Transfer Approach to Source Separation

  title={A Style Transfer Approach to Source Separation},
  author={Shrikant Venkataramani and Efthymios Tzinis and Paris Smaragdis},
  journal={2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
Training neural networks for source separation involves presenting a mixture recording at the input of the network and updating network parameters in order to produce an output that resembles the clean source. Consequently, supervised source separation depends on the availability of paired mixture-clean training examples. In this paper, we interpret source separation as a style transfer problem. We present a variational auto-encoder network that exploits the commonality across the domain of… 

Figures from this paper

Self-supervised Learning for Speech Enhancement
This work uses a limited training set of clean speech sounds and autoencode on speech mixtures recorded in noisy environments to train the resulting autoencoder to share a latent representation with the clean examples, and shows that it can map noisy speech to its clean version using a network that is autonomously trainable without requiring labeled training examples or human intervention.
Content Based Singing Voice Extraction from a Musical Mixture
A deep learning based methodology for extracting the singing voice signal from a musical mixture based on the underlying linguistic content that is able to extract the unprocessed raw vocal signal from the mixture even for a processed mixture dataset with singers not seen during training.
Impact of Minimum Hyperspherical Energy Regularization on Time-Frequency Domain Networks for Singing Voice Separation
  • Neil Shah, Dharmeshkumar Agrawal
  • Computer Science
    2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2020
This work proposes to use Gammatone auditory features for the Time-Frequency (T-F) mask-based singing voice separation task and experimentally shows the failure of MHE regularized T-F domain networks with respect to their unregularized versions and the need of designing a suitable adversarial objective function.


Monoaural Audio Source Separation Using Variational Autoencoders
A principled generative approach using variational autoencoders (VAE) for audio source separation using a latent generative model and shows that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature.
Unsupervised Deep Clustering for Source Separation: Direct Learning from Mixtures Using Spatial Information
A deep clustering approach is used which trains on multichannel mixtures and learns to project spectrogram bins to source clusters that correlate with various spatial features, and shows that this system is capable of performing sound separation on monophonic inputs, despite having learned how to do so using multi-channel recordings.
Deep clustering and conventional networks for music separation: Stronger together
It is shown that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation.
Bootstrapping Single-channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures
The idea is to use simple, low-level processing to separate sources in an unsupervised fashion, identify easy conditions, and then use that knowledge to bootstrap a (self-)supervised source separation model for difficult conditions.
Deep clustering: Discriminative embeddings for segmentation and separation
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
A Universal Music Translation Network
This method is based on a multi-domain wavenet autoencoder, with a shared encoder and a disentangled latent space that is trained end-to-end on waveforms, allowing it to translate even from musical domains that were not seen during training.
Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation
We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised
End-To-End Source Separation With Adaptive Front-Ends
An auto-encoder neural network is developed that can act as an equivalent to short-time front-end transforms and demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal.
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
  • Scott Wisdom, J. Hershey, R. Saurous
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This paper presents a new approach to masking that applies mixture consistency to complex-valued short-time Fourier transforms (STFTs) using real-valued masks, and shows that this approach can be effective in speech enhancement.
Semi-supervised Monaural Singing Voice Separation with a Masking Network Trained on Synthetic Mixtures
This work studies the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music, and employs a single mapping function g, which recovers the underlying instrumental music and, applied to an instrumental sample, returns the same sample.