A Bayesian Permutation training deep representation learning method for speech enhancement with variational autoencoder

@inproceedings{Xiang2022ABP,
  title={A Bayesian Permutation training deep representation learning method for speech enhancement with variational autoencoder},
  author={Yang Xiang and Jesper Lisby H{\o}jvang and Morten H{\o}jfeldt Rasmussen and Mads Gr{\ae}sb{\o}ll Christensen},
  booktitle={ICASSP},
  year={2022}
}
Recently, variational autoencoder (VAE), a deep representation learning (DRL) model, has been used to perform speech enhancement (SE). However, to the best of our knowledge, current VAEbased SE methods only apply VAE to model speech signal, while noise is modeled using the traditional non-negative matrix factorization (NMF) model. One of the most important reasons for using NMF is that these VAE-based methods cannot disentangle the speech and noise latent variables from the observed signal… 

Figures and Tables from this paper

A deep representation learning speech enhancement method using $\beta$-VAE
TLDR
The proposed β -VAE strategy can be used to optimize the DNN’s structure and acquire better speech and noise latent representation than PVAE, and obtains a higher scale-invariant signal-to- distortion ratio, speech quality, and speech intelligibility.

References

SHOWING 1-10 OF 30 REFERENCES
A VARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT
TLDR
A Monte Carlo expectation-maximization algorithm for inferring the latent variables in the variational autoencoder and estimating the unsupervised model parameters is developed and shows that the proposed method outperforms a semi-supervised NMF baseline and a state-of-the-art fully supervised deep learning approach.
Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder
TLDR
It is shown that the proposed noise-aware VAE outperforms the standard VAE in terms of overall distortion without increasing the number of model parameters, and is capable of generalizing to unseen noise conditions better than a supervised feedforward deep neural network (DNN).
Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization
TLDR
This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech that outperformed the conventional DNN-based method in unseen noisy environments.
Guided Variational Autoencoder for Speech Enhancement with a Supervised Classifier
TLDR
Provided that the label better informs the latent distribution and that the classifier achieves good performance, the proposed approach outperforms the standard variational autoencoder and a conventional neural network- based supervised approach.
Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation
TLDR
This paper addresses the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech.
A Parallel-Data-Free Speech Enhancement Method Using Multi-Objective Learning Cycle-Consistent Generative Adversarial Network
  • Yang Xiang, C. Bao
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
TLDR
A novel parallel-data-free speech enhancement method, in which the cycle-consistent generative adversarial network (CycleGAN) and multi-objective learning are employed, which is effective to improve speech quality and intelligibility when the networks are trained under the parallel data.
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
TLDR
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
Semi-supervised Multichannel Speech Enhancement with Variational Autoencoders and Non-negative Matrix Factorization
TLDR
A Monte Carlo expectation-maximization algorithm is developed and it is experimentally shown that the proposed approach outperforms its NMF-based counterpart, where speech is modeled using supervised NMF.
An NMF-HMM Speech Enhancement Method Based on Kullback-Leibler Divergence
TLDR
A novel supervised Non-negative Matrix Factorization ( NMF) speech enhancement method, which is based on Hidden Markov Model (HMM) and KullbackLeibler (KL) divergence (NMF-HMM), where the sum of Poisson is used as the observation model for each state of HMM.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
...
...