Monoaural Audio Source Separation Using Variational Autoencoders

  title={Monoaural Audio Source Separation Using Variational Autoencoders},
  author={Laxmi Pandey and Anurendra Kumar and Vinay Namboodiri},
We introduce a monaural audio source separation framework using a latent generative model. [] Key Method It contains a probabilistic encoder which projects an input data to latent space and a probabilistic decoder which projects data from latent space back to input space. This allows us to learn a robust latent representation of sources corrupted with noise and other sources. The latent representation is then fed to the decoder to yield the separated source. Both encoder and decoder are implemented via…

Figures and Tables from this paper

Audio Source Separation Using Variational Autoencoders and Weak Class Supervision
This letter proposes a source separation method that is trained by observing the mixtures and the class labels of the sources present in the mixture without any access to isolated sources and shows that the separation performance obtained is as good as the performance obtained by source signal supervision.
Joint Source Separation and Classification Using Variational Autoencoders
In this paper, we propose a novel multi-task variational auto encoder (VAE) based approach for joint source separation and classification. The network uses a probabilistic encoder for each sources to
A Style Transfer Approach to Source Separation
A variational auto-encoder network is presented that exploits the commonality across the domain of mixtures and thedomain of clean sounds and learns a shared latent representation across the two domains and performs source separation without explicitly supervising with paired training examples.
Weak Label Supervision for Monaural Source Separation Using Non-negative Denoising Variational Autoencoders
This paper proposes a weak supervision method that only uses class information rather than source signals for learning to separate short utterance mixtures, and demonstrates that the separation results are on par with source signal supervision.
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
This paper shows how a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization of audio spectrograms and its application to audio source separation and provides insights on the choice and interpretability of data representation and model parameterization.
A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
The results of an experimental benchmark comparing six of the DVAE models on the speech analysis-resynthesis task are presented, as an illustration of the high potential of DVAEs for speech modeling.
Latent Representation Learning for Artificial Bandwidth Extension Using a Conditional Variational Auto-encoder
This paper reports the first application of conditional variational auto-encoders (CVAEs) for supervised dimensionality reduction specifically tailored to ABE, and shows that the probabilistic latent representations learned with CVAEs produce bandwidth-extended speech signals of notably better quality.
Speech Enhancement with Variance Constrained Autoencoders
This work proposes using the Variance Constrained Autoencoder (VCAE) for speech enhancement and demonstrates experimentally that the proposed enhancement model outperforms SE-GAN and SE-WaveNet in terms of perceptual quality of enhanced signals.
CASS: Cross Adversarial Source Separation via Autoencoder
A new model that aims at separating an input signal consisting of a mixture of multiple components into individual components defined via adversarial learning and autoencoder fitting is introduced, achieving state-of-the-art performance especially when target components share similar data structures.
EnerGAN++: A Generative Adversarial Gated Recurrent Network for Robust Energy Disaggregation
The proposed EnerGAN++ is a model based on Generative Adversarial Networks (GAN) for robust energy disaggregation that unifies the autoencoder (AE) and GAN architectures into a single framework and leverages the ability of Convolutional Neural Networks (CNN) in rapid processing and optimal feature extraction, among with the need to infer the data temporal character and time dependence.


Single channel audio source separation using convolutional denoising autoencoders
This work proposes to use deep fully convolutional denoising autoencoders (CDAEs) for monaural audio source separation and shows that CDAEs perform source separation slightly better than the deep feedforward neural networks (FNNs) even with fewer parameters than FNNs.
Multichannel Audio Source Separation With Deep Neural Networks
This article proposes a framework where deep neural networks are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information and presents its application to a speech enhancement problem.
Adaptive Denoising Autoencoders: A Fine-Tuning Scheme to Learn from Test Mixtures
Experimental results on audio source separation tasks demonstrate that the proposed fine-tuning technique can further improve the sound quality of a DAE during the test procedure.
Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
Joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including speech separation, singing voice separation, and speech denoising, and a discriminative criterion for training neural networks to further enhance the separation performance are explored.
Deep learning for monaural speech separation
The joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint, is proposed to enhance the separation performance of monaural speech separation models.
Discriminative NMF and its application to single-channel source separation
Results on the 2nd CHiME Speech Separation and Recognition Challenge task indicate significant gains in source-to-distortion ratio with respect to sparseNMF, exemplar-based NMF, as well as a previously proposed discriminative NMF criterion.
Extracting and composing robust features with denoising autoencoders
This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.
Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria
  • T. Virtanen
  • Computer Science
    IEEE Transactions on Audio, Speech, and Language Processing
  • 2007
An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented and enables a better separation quality than the previous algorithms.
Speech enhancement based on deep denoising autoencoder
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.
A Fully Convolutional Neural Network for Speech Enhancement
The proposed network, Redundant Convolutional Encoder Decoder (R-CED), demonstrates that a convolutional network can be 12 times smaller than a recurrent network and yet achieves better performance, which shows its applicability for an embedded system: the hearing aids.