Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

@article{Xu2022DeepNS,
  title={Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet},
  author={Ziyi Xu and Maximilian Strake and Tim Fingscheidt},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2022},
  volume={30},
  pages={1572-1585}
}
Speech enhancement employing deep neural networks (DNNs) for denoising is called deep noise suppression (DNS). The DNS trained with mean squared error (MSE) losses cannot guarantee good perceptual quality. Perceptual evaluation of speech quality (PESQ) is a widely used metric for evaluating speech quality. However, the original PESQ algorithm is non-differentiable, therefore, cannot directly be used as optimization criterion for gradient-based learning. In this work, we propose an end-to-end… 
Does a PESQNet (Loss) Require a Clean Reference Input? The Original PESQ Does, But ACR Listening Tests Don't
TLDR
It is concluded that it is unnecessary to employ an intrusive PESQNet for DNS training, which opens the possibility to use real training data while achieving comparable performance with employing the still powerful non-intrusive PESZNet.
MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data
TLDR
This work proposes MetricGAN+/- (an extension of Metric-GAN+, one such metric-motivated system) which introduces an additional network - a “de-generator” to improve the robustness of the prediction network (and by extension of the generator) by ensuring observation of a wider range of metric scores in training.

References

SHOWING 1-10 OF 54 REFERENCES
Deep Noise Suppression with Non-Intrusive PESQNet Supervision Enabling the Use of Real Training Data
TLDR
This work proposes an end-to-end non-intrusive PESQNet DNN which es-timates perceptual evaluation of speech quality (PESQ) scores, allowing a reference-free loss for real data.
Separated Noise Suppression and Speech Restoration: Lstm-Based Speech Enhancement in Two Stages
TLDR
This work proposes to address the problem of speech distortions can be introduced when employing NNs trained to provide strong noise suppression by first suppressing noise and subsequently restoring speech with specifically chosen NN topologies for each of these distinct tasks.
A Perceptual Weighting Filter Loss for DNN Training In Speech Enhancement
TLDR
The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation.
Learning With Learned Loss Function: Speech Enhancement With Quality-Net to Improve Perceptual Evaluation of Speech Quality
TLDR
This study proposes optimizing the enhancement model with an approximated PESQ function, which is differentiable and learned from the training data, and shows that the learned surrogate function can guide the Enhancement model to further boost the PESZ score and maintain the speech intelligibility.
A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions
TLDR
A new perceptually-weighted objective function is proposed within a feedforward DNN framework, aiming to minimize the perceptual difference between the enhanced speech and the target speech.
Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure
TLDR
It is shown that the proposed SE system, when trained using an approximate-STOI cost function performs on par with a system trained with a mean square error cost applied to short-time temporal envelopes, suggesting that traditional DNN based STSA SE systems might be optimal in terms of estimated speech intelligibility.
Training Supervised Speech Separation System to Improve STOI and PESQ Directly
TLDR
Experimental results show the speech separation performance can be improved by the proposed method, and the calculated gradients are used in the gradient descent algorithm to optimize the STOI and PESQ directly.
A deep neural network for time-domain signal reconstruction
  • Yuxuan Wang, Deliang Wang
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
A new deep network is proposed that directly reconstructs the time-domain clean signal through an inverse fast Fourier transform layer and significantly outperforms a recent non-negative matrix factorization based separation system in both objective speech intelligibility and quality.
End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks
TLDR
Experimental results show that the STOI of a test speech processed by the proposed end-to-end utterance-based speech enhancement framework using fully convolutional neural networks is better than conventional MSE-optimized speech due to the consistency between the training and the evaluation targets.
A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality
TLDR
Two disturbance terms, which account for distortion once auditory masking and threshold effects are factored in, amend the mean square error (MSE) loss function by introducing perceptual criteria based on human psychoacoustics.
...
...