• Corpus ID: 245124354

Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR

  title={Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR},
  author={Peter Plantinga and Deblin Bagchi and Eric Fosler-Lussier},
Single-channel speech enhancement approaches do not always improve automatic recognition rates in the presence of noise, because they can introduce distortions unhelpful for recognition. Following a trend towards end-to-end training of sequential neural network models, several research groups have addressed this problem with joint training of front-end enhancement module with back-end recognition module. While this approach ensures enhancement outputs are helpful for recognition, the… 


Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition
An in-depth evaluation of such techniques as a front-end for noise-robust automatic speech recognition (ASR) and a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks and HMM are performed.
A Joint Training Framework for Robust Automatic Speech Recognition
A novel joint training framework for speech separation and recognition to concatenate a deep neural network based speech separation frontend and a DNN-based acoustic model to build a larger neural network, and jointly adjust the weights in each module.
Joint training of front-end and back-end deep neural networks for robust speech recognition
It is shown that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary.
Bridging the Gap Between Monaural Speech Enhancement and Recognition with Distortion-Independent Acoustic Modeling
Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem, and the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.
Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models
A generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses is introduced and a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.
Speech Denoising with Deep Feature Losses
An end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly, which outperforms the state-of-the-art in objective speech quality metrics and in large-scale perceptual experiments with human listeners.
Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition
A system for the 4th CHiME challenge which significantly increases the performance for all three tracks with respect to the provided baseline system and is independent of the microphone configuration, i.e., a configuration which does not combine multiple systems.
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors.
Multi-Task Self-Supervised Learning for Robust Speech Recognition
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
Spectral Feature Mapping with MIMIC Loss for Robust Speech Recognition
A global criterion is proposed to add a global criterion to ensure de-noised speech is useful for downstream tasks like ASR and shows significant improvements in WER.