• Corpus ID: 238253029

Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement

  title={Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement},
  author={Zhong-Qiu Wang and Gordon Wichern and Jonathan Le Roux},
A promising approach for multi-microphone speech separation involves two deep neural networks (DNN), where the predicted target speech from the first DNN is used to compute signal statistics for time-invariant minimum variance distortionless response (MVDR) beamforming, and the MVDR result is then used as extra features for the second DNN to predict target speech. Previous studies suggested that the MVDR result can provide complementary information for the second DNN to better predict target… 

Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge

The proposed method, which combines Deep Neural Network driven complex spectral mapping with linear beamformers such as the multi-frame multi-channel Wiener filter, was ranked first in the challenge, achieving a ranking metric of 0.984, versus 0.833 of the challenge baseline.

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems.

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

TF-GridNet is extended to multi-microphone conditions through multi- microphone complex spectral mapping, and integrated into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this paper is a novel multi-frame Wiener computed based on the outputs of the first DNN.

STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency

This work employs complex spectral mapping for frame-online enhancement, where a deep neural network is trained to predict the real and imaginary components of target speech from the mixture RI components, and proposes a future-frame prediction technique to reduce the algorithmic latency.

Improving Frame-Online Neural Speech Enhancement With Overlapped-Frame Prediction

This work proposes an overlapped-frame prediction technique for deep learning based frame-online speech enhancement, where at each frame the authors' deep neural network predicts the current and several past frames that are necessary for overlap-add, instead of only predicting the current frame.

TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation

We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR, especially in the multi-channel scenario.



Deep Learning Based Target Cancellation for Speech Dereverberation

These models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.

Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

A novel method of time-varying beamforming with estimated complex spectra for single- and multi-channel speech enhancement, where deep neural networks are used to predict the real and imaginary components of the direct-path signal from noisy and reverberant ones.

Real-Time Binaural Speech Separation with Preserved Spatial Cues

  • Cong HanYi LuoN. Mesgarani
  • Physics
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A multi-input-multi-output (MIMO) end-to-end extension of TasNet that takes binaural mixed audio as input and simultaneously separates target speakers in both channels is proposed, enabling a real-time modification of the acoustic scene.

Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation

The key idea is to first estimate the direct-path signal of the target speaker using a DNN and then identify signals that are decayed and delayed copies of the estimated direct- path signal, as these can be reliably considered as reverberation.

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation

This study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation, and integrates multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation.

Jointly Optimal Denoising, Dereverberation, and Source Separation

Methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way are proposed.

A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR

The core of the algorithm estimates a time-frequency mask which represents the target speech and use masking-based beamforming to enhance corrupted speech and propose a masked-based post-filter to further suppress the noise in the output of beamforming.

Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement

Systematic evaluations and comparisons on the NIST SRE 2010 retransmitted corpus show that both monaural and multi-channel speech enhancement significantly outperform x-vector's performance, and the covariance matrix estimate is effective for the MVDR beamformer.

Blind and Neural Network-Guided Convolutional Beamformer for Joint Denoising, Dereverberation, and Source Separation

A blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics is developed and greatly outperforms the conventional state-of-the-art NN-supported mask-based CBF in terms of the improvement in automatic speech recognition and signal distortion reduction performance.

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction, and can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.