Online Monaural Speech Enhancement Using Delayed Subband LSTM

  title={Online Monaural Speech Enhancement Using Delayed Subband LSTM},
  author={Xiaofei Li and Radu Horaud},
This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement. The proposed method is developed in the short time Fourier transform (STFT) domain. Online processing requires frame-by-frame signal reception and processing. A paramount feature of the proposed method is that the same LSTM is used across frequencies, which drastically reduces the number of network parameters, the amount of training data and the computational burden. Training is performed… 

Figures and Tables from this paper

Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation

This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet)

DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement

The new model, named DCC RN+, has surpassed the original DCCRN as well as several competitive models in terms of PESQ and DNSMOS, and has achieved superior performance in the new Interspeech 2021 DNS challenge.

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

An extended single-channel real-time speech enhancement framework called FullSubNet+ with following significant improvements is proposed, which reaches the state-of-the-art (SOTA) performance and outperforms other existing speech enhancement approaches.

Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement

A phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter that was evaluated on the open Voice Bank+DEMAND dataset and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.

Speech Dereverberation with a Reverberation Time Shortening Target

This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or

Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement

A lightweight full-band and sub-band fusion network, where dual-branch based architecture is employed for modeling local and global spectral pattern si-multaneously, which has achieved superior performance to other state-of-the-art ap-proaches with smaller model size and lower latency.

Speech Enhancement with Fullband-Subband Cross-Attention Network

FullSubNet has shown its promising performance on speech enhancement by utilizing both fullband and subband information. However, the relationship between fullband and subband in FullSubNet is

Quality Enhancement of Overdub Singing Voice Recordings

Singing enhancement aims to improve the perceived quality of a singing voice recording in various aspects. Focusing on the aspect of removing degradation such as background noise or room

Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms

The paper proposes three strategies to estimate the noise PSD frame by frame, and then the residual noise can be removed effectively by applying a gain function based on the decision-directed approach, and objective measurement results show that the proposed postfiltering strategies outperform the conventional postfilter in terms of segmental signal-to-noise ratio (SNR) as well as speech quality improvement.

Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

Experimental results show that full-band and sub-band information are complementary, and the FullSubNet can effectively integrate them, and exceeds that of the top-ranked methods in the DNS Challenge (INTERSPEECH 2020).



Multichannel Speech Enhancement Based On Time-Frequency Masking Using Subband Long Short-Term Memory

  • Xiaofei LiR. Horaud
  • Engineering
    2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2019
The proposed multichannel speech enhancement method using a long short-term memory (LSTM) recurrent neural network outperforms the baseline deep-learning-based full-band method and unsupervised method and generalizes very well to unseen speakers and noise types.

Audio-Noise Power Spectral Density Estimation Using Long Short-Term Memory

Speaker- and speech-independent experiments with different types of noise show that the proposed method outperforms the unsupervised estimators, and it generalizes well to noise types that are not present in the training set.

Complex Ratio Masking for Monaural Speech Separation

The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

Convolutional Recurrent Neural Network Based Progressive Learning for Monaural Speech Enhancement

This work proposes a novel progressive learning framework with causal convolutional recurrent neural networks called PL-CRNN, which takes advantage of both Convolutional neural networks and recurrent neural Networks to drastically reduce the number of parameters and simultaneously improve speech quality and speech intelligibility.

Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement

This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement and proposes two novel mean-squared-error-based learning objectives.

Exploring Monaural Features for Classification-Based Speech Segregation

This paper expands T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cep stral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP), and proposes to use a group Lasso approach to select complementary features in a principled way.

Single-channel speech separation with memory-enhanced recurrent neural networks

The proposed Long Short-Term Memory recurrent neural networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features, which outperforms unsupervised magnitude domain spectral subtraction by a large margin in terms of source-distortion ratio.

Multiple-target deep learning for LSTM-RNN based speech enhancement

The proposed framework can consistently and significantly improve the objective measures for both speech quality and intelligibility and a novel multiple-target joint learning approach is designed to fully utilize this complementarity.

Long short-term memory for speaker generalization in supervised speech separation.

A separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech and which substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility.

A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement

This paper incorporates a convolutional encoderdecoder (CED) and long short-term memory (LSTM) into the CRN architecture, which leads to a causal system that is naturally suitable for real-time processing.