Universal Speech Enhancement with Score-based Diffusion

  title={Universal Speech Enhancement with Score-based Diffusion},
  author={Joan Serr{\`a} and Santiago Pascual and Jordi Pons and R. Oğuz Araz and Davide Scaini},
Removing background noise from speech audio has been the subject of considerable research and effort, especially in recent years due to the rise of virtual communication and amateur sound recording. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech… 

Figures and Tables from this paper

Blurring Diffusion Models

Does their any able

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

This work builds upon the previous work and derives the training task within the formalism of stochastic differential equations and achieves remarkable state-of-the- art performance in single-channel speech dereverberation.



Restoring degraded speech via a modified diffusion model

A neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal, and achieves better performance on several objective perceptual metrics and in subjective comparisons.

DiffWave: A Versatile Diffusion Model for Audio Synthesis

DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

WaveGrad: Estimating Gradients for Waveform Generation

WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality.

DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments (1.0)

  • In Proc. of the Int. Congress on Acoustics (ICA),
  • 2013

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Learning Sound Event Classifiers from Web Audio with Noisy Labels

Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.

Speech enhancement

  • Jae S. Lim
  • Computer Science
    ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1986
An overview of various techniques that have been proposed for enhancement of speech is provided to suggest some directions for future research in the speech enhancement problem.

A short-time objective intelligibility measure for time-frequency weighted noisy speech

An objective intelligibility measure is presented, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech, and shows significantly better performance than three other, more sophisticated, objective measures.

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

This work propos-ing a novel training task for speech enhancement using a complex-valued deep neural network, and derives this training task within the formalism of stochastic differential equations (SDEs), thereby enabling the use of predictor-corrector samplers.