Online Monaural Speech Enhancement Using Delayed Subband LSTM

@article{Li2020OnlineMS,
  title={Online Monaural Speech Enhancement Using Delayed Subband LSTM},
  author={Xiaofei Li and Radu Horaud},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.05037}
}
This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement. The proposed method is developed in the short time Fourier transform (STFT) domain. Online processing requires frame-by-frame signal reception and processing. A paramount feature of the proposed method is that the same LSTM is used across frequencies, which drastically reduces the number of network parameters, the amount of training data and the computational burden. Training is performed… 

Figures and Tables from this paper

Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation

This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet)

DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement

The model is extended to sub-band processing where the bands are split and merged by learnable neural network filters instead of engineered FIR filters, leading to a faster noise suppressor trained in an end-to-end manner and a post-processing module is adopted to further suppress the unnatural residual noise.

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

An extended single-channel real-time speech enhancement framework called FullSubNet+ with following significant improvements is proposed, which reaches the state-of-the-art (SOTA) performance and outperforms other existing speech enhancement approaches.

Single-Channel Speech Dereverberation using Subband Network with A Reverberation Time Shortening Target

This work proposes a subband network for single-channel speech dereverberation, and also a new learning target based on reverberation time shortening (RTS). In the time-frequency domain, we propose

Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement

A phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter that was evaluated on the open Voice Bank+DEMAND dataset and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.

Speech Dereverberation with a Reverberation Time Shortening Target

This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or

Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement

A lightweight full-band and sub-band fusion network, where dual-branch based architecture is employed for modeling local and global spectral pattern si-multaneously, which has achieved superior performance to other state-of-the-art ap-proaches with smaller model size and lower latency.

Speech Enhancement with Fullband-Subband Cross-Attention Network

FullSubNet has shown its promising performance on speech enhancement by utilizing both fullband and subband information. However, the relationship between fullband and subband in FullSubNet is

Quality Enhancement of Overdub Singing Voice Recordings

Singing enhancement aims to improve the perceived quality of a singing voice recording in various aspects. Focusing on the aspect of removing degradation such as background noise or room

Fast FullSubNet: Accelerate Full-band and Sub-band Fusion Model for Single-channel Speech Enhancement

Experimental results show that, compared to Full sub-band speech spectra, Fast FullSubNet has only 13% computational complexity and 16% processing time, and achieves comparable or even better performance.

References

SHOWING 1-10 OF 29 REFERENCES

Audio-Noise Power Spectral Density Estimation Using Long Short-Term Memory

Speaker- and speech-independent experiments with different types of noise show that the proposed method outperforms the unsupervised estimators, and it generalizes well to noise types that are not present in the training set.

Complex Ratio Masking for Monaural Speech Separation

The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

Convolutional Recurrent Neural Network Based Progressive Learning for Monaural Speech Enhancement

This work proposes a novel progressive learning framework with causal convolutional recurrent neural networks called PL-CRNN, which takes advantage of both Convolutional neural networks and recurrent neural Networks to drastically reduce the number of parameters and simultaneously improve speech quality and speech intelligibility.

Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement

This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement and proposes two novel mean-squared-error-based learning objectives.

Exploring Monaural Features for Classification-Based Speech Segregation

This paper expands T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cep stral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP), and proposes to use a group Lasso approach to select complementary features in a principled way.

Single-channel speech separation with memory-enhanced recurrent neural networks

The proposed Long Short-Term Memory recurrent neural networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features, which outperforms unsupervised magnitude domain spectral subtraction by a large margin in terms of source-distortion ratio.

Multiple-target deep learning for LSTM-RNN based speech enhancement

The proposed framework can consistently and significantly improve the objective measures for both speech quality and intelligibility and a novel multiple-target joint learning approach is designed to fully utilize this complementarity.

Long short-term memory for speaker generalization in supervised speech separation.

A separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech and which substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility.

A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement

This paper incorporates a convolutional encoderdecoder (CED) and long short-term memory (LSTM) into the CRN architecture, which leads to a causal system that is naturally suitable for real-time processing.

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework

A large clean speech and noise corpus is opened for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings and an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments.