Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation

  title={Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation},
  author={Hyeongju Kim and Hyeonseung Lee and Woo Hyun Kang and Hyung Yong Kim and Nam Soo Kim},
For multi-channel speech recognition, speech enhancement techniques such as denoising or dereverberation are conventionally applied as a front-end processor. Deep learning-based front-ends using such techniques require aligned clean and noisy speech pairs which are generally obtained via data simulation. Recently, several joint optimization techniques have been proposed to train the front-end without parallel data within an end-to-end automatic speech recognition (ASR) scheme. However, the ASR… Expand
1 Citations

Figures and Tables from this paper

Robust Speech Representation Learning via Flow-based Embedding Regularization
  • Woo Hyun Kang, Jahangir Alam, Abderrahim Fathan
  • Engineering, Computer Science
  • 2021
Over the recent years, various deep learning-based methods were proposed for extracting a fixeddimensional embedding vector from speech signals. Although the deep learning-based embedding extractionExpand


Joint Optimization of Neural Network-based WPE Dereverberation and Acoustic Model for Robust Online ASR
Evaluation on two databases demonstrates improved performance for on-line processing scenarios while imposing fewer requirements on the available training data and thus widening the range of applications. Expand
Learning Spectral Mapping for Speech Dereverberation and Denoising
Deep neural networks are trained to directly learn a spectral mapping from the magnitude spectrogram of corrupted speech to that of clean speech, which substantially attenuates the distortion caused by reverberation, as well as background noise, and is conceptually simple. Expand
Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks
It is shown that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition. Expand
A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research
The REVERB challenge is described, which is an evaluation campaign that was designed to evaluate such speech enhancement and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Expand
Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition
A recurrent neural network with long short-term memory (LSTM) architecture is proposed to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. Expand
Joint CTC-attention based end-to-end speech recognition using multi-task learning
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. Expand
ESPnet: End-to-End Speech Processing Toolkit
A major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks are explained. Expand
Waveglow: A Flow-based Generative Network for Speech Synthesis
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Expand
Speech recognition in noisy environments: A survey
  • Y. Gong
  • Computer Science
  • Speech Commun.
  • 1995
The survey indicates that the essential points in noisy speech recognition consist of incorporating time and frequency correlations, giving more importance to high SNR portions of speech in decision making, exploiting task-specific a priori knowledge both of speech and of noise, using class-dependent processing, and including auditory models in speech processing. Expand
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
The first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end- to-end training from scratch is introduced, which significantly outperforms the previous pipeline that connects a text-To-spectrogram model to a separately trained WaveNet. Expand