A Style Transfer Approach to Source Separation
@article{Venkataramani2019AST, title={A Style Transfer Approach to Source Separation}, author={Shrikant Venkataramani and Efthymios Tzinis and Paris Smaragdis}, journal={2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year={2019}, pages={170-174} }
Training neural networks for source separation involves presenting a mixture recording at the input of the network and updating network parameters in order to produce an output that resembles the clean source. Consequently, supervised source separation depends on the availability of paired mixture-clean training examples. In this paper, we interpret source separation as a style transfer problem. We present a variational auto-encoder network that exploits the commonality across the domain of…
3 Citations
Self-supervised Learning for Speech Enhancement
- Computer ScienceArXiv
- 2020
This work uses a limited training set of clean speech sounds and autoencode on speech mixtures recorded in noisy environments to train the resulting autoencoder to share a latent representation with the clean examples, and shows that it can map noisy speech to its clean version using a network that is autonomously trainable without requiring labeled training examples or human intervention.
Content Based Singing Voice Extraction from a Musical Mixture
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A deep learning based methodology for extracting the singing voice signal from a musical mixture based on the underlying linguistic content that is able to extract the unprocessed raw vocal signal from the mixture even for a processed mixture dataset with singers not seen during training.
Impact of Minimum Hyperspherical Energy Regularization on Time-Frequency Domain Networks for Singing Voice Separation
- Computer Science2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2020
This work proposes to use Gammatone auditory features for the Time-Frequency (T-F) mask-based singing voice separation task and experimentally shows the failure of MHE regularized T-F domain networks with respect to their unregularized versions and the need of designing a suitable adversarial objective function.
References
SHOWING 1-10 OF 35 REFERENCES
Monoaural Audio Source Separation Using Variational Autoencoders
- Computer ScienceINTERSPEECH
- 2018
A principled generative approach using variational autoencoders (VAE) for audio source separation using a latent generative model and shows that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature.
Unsupervised Deep Clustering for Source Separation: Direct Learning from Mixtures Using Spatial Information
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
A deep clustering approach is used which trains on multichannel mixtures and learns to project spectrogram bins to source clusters that correlate with various spatial features, and shows that this system is capable of performing sound separation on monophonic inputs, despite having learned how to do so using multi-channel recordings.
Deep clustering and conventional networks for music separation: Stronger together
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
It is shown that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation.
Bootstrapping Single-channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
The idea is to use simple, low-level processing to separate sources in an unsupervised fashion, identify easy conditions, and then use that knowledge to bootstrap a (self-)supervised source separation model for difficult conditions.
Deep clustering: Discriminative embeddings for segmentation and separation
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
A Universal Music Translation Network
- Computer ScienceICLR
- 2019
This method is based on a multi-domain wavenet autoencoder, with a shared encoder and a disentangled latent space that is trained end-to-end on waveforms, allowing it to translate even from musical domains that were not seen during training.
Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised…
End-To-End Source Separation With Adaptive Front-Ends
- Computer Science2018 52nd Asilomar Conference on Signals, Systems, and Computers
- 2018
An auto-encoder neural network is developed that can act as an equivalent to short-time front-end transforms and demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal.
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper presents a new approach to masking that applies mixture consistency to complex-valued short-time Fourier transforms (STFTs) using real-valued masks, and shows that this approach can be effective in speech enhancement.
Semi-supervised Monaural Singing Voice Separation with a Masking Network Trained on Synthetic Mixtures
- MathematicsICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This work studies the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music, and employs a single mapping function g, which recovers the underlying instrumental music and, applied to an instrumental sample, returns the same sample.