Real Time Speech Enhancement in the Waveform Domain

@inproceedings{Dfossez2020RealTS,
  title={Real Time Speech Enhancement in the Waveform Domain},
  author={Alexandre D{\'e}fossez and Gabriel Synnaeve and Yossi Adi},
  booktitle={INTERSPEECH},
  year={2020}
}
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation… Expand

Figures and Tables from this paper

High Fidelity Speech Regeneration with Application to Speech Enhancement
TLDR
A wav-to-wav generative model for speech is proposed that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Expand
Speech Enhancement with Mixture of Deep Experts with Clean Clustering Pre-Training
TLDR
The architecture comprises a set of deep neural networks, each of which is an ‘expert’ in a different speech spectral pattern such as phoneme, which allows better robustness to unfamiliar noise types. Expand
Speech Enhancement for Wake-Up-Word detection in Voice Assistants
TLDR
The results obtained by concatenating the SE with a simple and state-ofthe-art WUW detectors show that the SE does not have a negative impact on the recognition rate in quiet environments while increasing the performance in the presence of noise, especially when the SE and W UW detector are trained jointly end-to-end. Expand
An Investigation of End-to-End Models for Robust Speech Recognition
TLDR
A detailed comparison of speech enhancement-based techniques and three different model-based adaptation techniques covering data augmentation, multi-task learning, and adversarial learning for robust ASR suggests that knowledge of the underlying noise type can meaningfully inform the choice of adaptation technique. Expand
Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models
TLDR
A generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses is introduced and a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses. Expand
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
TLDR
This study proposes a MetricGAN+ in which three training techniques incorporating domainknowledge of speech processing are proposed and can increase PESQ score by 0.3 compared to the previous metricGAN and achieve state-of-the-art results. Expand
CDPAM: Contrastive Learning for Perceptual Audio Similarity
TLDR
CDPAM is introduced –a metric that builds on and advances DPAM, and it is shown that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests. Expand
ICASSP 2021 Deep Noise Suppression Challenge: Decoupling Magnitude and Phase Optimization with a Two-Stage Deep Network
TLDR
This work proposes a novel system for denoising in the complicated applications, which is mainly comprised of two pipelines, namely a two-stage network and a post-processing module, and demonstrates to sufficiently improve the subjective quality. Expand
BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge
TLDR
This paper describes joint effort of BUT and Telefónica Research on development of Automatic Speech Recognition systems for Albayzin 2020 Challenge, and assess the effect of using a neural-based music separator named Demucs. Expand
DPT-FSNet: Dual-path Transformer Based Full-band and Sub-band Fusion Network for Speech Enhancement
TLDR
Experimental results show that the proposed dualpath transformer-based full-band and sub-band fusion network (DPT-FSNet) for speech enhancement in the frequency domain outperforms the current state-of-the-arts in terms of PESQ, STOI, CSIG, COVL. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 47 REFERENCES
A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement
TLDR
This paper incorporates a convolutional encoderdecoder (CED) and long short-term memory (LSTM) into the CRN architecture, which leads to a causal system that is naturally suitable for real-time processing. Expand
Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement
TLDR
This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement and proposes two novel mean-squared-error-based learning objectives. Expand
A deep neural network for time-domain signal reconstruction
  • Yuxuan Wang, Deliang Wang
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
A new deep network is proposed that directly reconstructs the time-domain clean signal through an inverse fast Fourier transform layer and significantly outperforms a recent non-negative matrix factorization based separation system in both objective speech intelligibility and quality. Expand
Speech Denoising with Deep Feature Losses
TLDR
An end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly, which outperforms the state-of-the-art in objective speech quality metrics and in large-scale perceptual experiments with human listeners. Expand
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework
TLDR
A large clean speech and noise corpus is opened for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings and an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments. Expand
Improved Speech Enhancement with the Wave-U-Net
TLDR
The Wave-U-Net architecture, a model introduced by Stoller et al for the separation of music vocals and accompaniment, is studied, finding that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. Expand
A Wavenet for Speech Denoising
TLDR
The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature. Expand
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Expand
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
TLDR
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods. Expand
SEGAN: Speech Enhancement Generative Adversarial Network
TLDR
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. Expand
...
1
2
3
4
5
...