• Corpus ID: 219177327

Phase-aware Single-stage Speech Denoising and Dereverberation with U-Net

  title={Phase-aware Single-stage Speech Denoising and Dereverberation with U-Net},
  author={Hyeong-Seok Choi and Hoon Heo and Jie Hwan Lee and Kyogu Lee},
In this work, we tackle a denoising and dereverberation problem with a single-stage framework. Although denoising and dereverberation may be considered two separate challenging tasks, and thus, two modules are typically required for each task, we show that a single deep network can be shared to solve the two problems. To this end, we propose a new masking method called phase-aware beta-sigmoid mask (PHM), which reuses the estimated magnitude values to estimate the clean phase by respecting the… 

Figures and Tables from this paper

Training Speech Enhancement Systems with Noisy Speech Datasets
This paper proposes several modifications of the loss functions, which make them robust against noisy speech targets, and proposes a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios.
Predicting score distribution to improve non-intrusive speech quality estimation
Several ways to integrate the distribution of opinion scores (e.g. variance, histogram information) to improve the MOS estimation performance are investigated to provide up to a 0.016 RMSE and 1% SRCC improvement.
HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features
Objective and subjective evaluations show that the proposed HiFi-GAN-2 outperforms state-of-the-art baselines on both conventional denoising as well as joint dereverberation andDenoising tasks.
Transformers with Competitive Ensembles of Independent Mechanisms
This work proposes Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention, and proposes a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent.
VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration
Both objective and subjective evaluations show that VoiceFixer is effective on severely degraded speech, such as real-world his-torical speech recordings, and a synthesis stage that generates waveform using a neural vocoder.
Deep learning in electron microscopy
This review paper offers a practical perspective aimed at developers with limited familiarity of deep learning in electron microscopy that discusses hardware and software needed to get started with deep learning and interface with electron microscopes.
ICASSP 2021 Deep Noise Suppression Challenge
A DNS challenge special session at INTERSPEECH 2020 was organized where the open-sourced training and test datasets were opened and a subjective evaluation framework was opened and used to evaluate and select the final winners.
Interactive Speech and Noise Modeling for Speech Enhancement
This paper proposes a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net, and designs a feature extraction module, namely residual-convolution-and-attention (RA), to capture the correlations along temporal and frequency dimensions for both the speech and the noises.
Interspeech 2021 Deep Noise Suppression Challenge
In this version of the Deep Noise Suppression challenge, the training and test datasets were expanded to accommodate fullband scenarios and challenging test conditions and a reliable non-intrusive objective speech quality metric for wideband called DNSMOS was made available for participants to use during their development phase.
Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation
This work introduces a self-attentive network with a novel sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA) SS performance at significantly smaller model size and computational cost.


Phase-aware Speech Enhancement with Deep Complex U-Net
A novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure and achieves state-of-the-art performance in all metrics.
Two-Stage Deep Learning for Noisy-Reverberant Speech Enhancement
This work proposes a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks, and designs a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes.
PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network
This paper proposes a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, which has the ability to handle detailed phase patterns and to utilize harmonic patterns, and outperforms previous methods by a large margin on four metrics.
Channel-Attention Dense U-Net for Multichannel Speech Enhancement
This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming.
Enhanced Time-Frequency Masking by Using Neural Networks for Monaural Source Separation in Reverberant Room Environments
The proposed enhanced time-frequency (T-F) mask to improve the separation performance outperforms the state-of-the-art methods specifically in highly reverberant and noisy room environments.
End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its
Multi-Scale multi-band densenets for audio source separation
A novel network architecture that extends the recently developed densely connected convolutional network (DenseNet) and takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio.
PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation
Experimental results show that the classificationbased approach successfully recovers the phase of the target source in the discretized domain, improves signal-todistortion ratio (SDR) over the regression-based approach in both speech enhancement task and music source separation (MSS) task, and outperforms state-of-the-art MSS.
SDR – Half-baked or Well Done?
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
Masking Estimation with Phase Restoration of Clean Speech for Monaural Speech Enhancement
Two T-F masks are presented to simultaneously enhance magnitude and phase of speech spectrum based on non-correlation assumption of real part and imaginary part about speech spectrum, and use them as the training target of the DNN model.