Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

  title={Multi-Channel Automatic Speech Recognition Using Deep Complex Unet},
  author={Yuxiang Kong and Jian Wu and Quandong Wang and Peng Gao and Weiji Zhuang and Yujun Wang and Lei Xie},
  journal={2021 IEEE Spoken Language Technology Workshop (SLT)},
  • Yuxiang Kong, Jian Wu, Lei Xie
  • Published 18 November 2020
  • Computer Science
  • 2021 IEEE Spoken Language Technology Workshop (SLT)
The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of… 

Figures and Tables from this paper

Multi-Channel Speech Enhancement with 2-D Convolutional Time-Frequency Domain Features and a Pre-Trained Acoustic Model
  • Quandong Wang, Junnan Wu, Yujun Wang
  • Computer Science
    2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2021
A fixed clean acoustic model trained with the end-to-end lattice-free maximum mutual information criterion is proposed to enforce the enhanced output to have the same distribution as the clean waveform to alleviate the over-estimation problem of the enhancement task and constrain distortion.


Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition
New acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly are developed and incorporated into the acoustic model.
Deep beamforming networks for multi-channel speech recognition
This work proposes to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network that obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.
Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning
Two approaches to improve deep neural network (DNN) acoustic models for speech recognition in reverberant environments are proposed, each using a parameterization of the reverberant environment extracted from the observed signal to train a room-aware DNN.
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition
This paper introduces a neural network architecture, which performs multichannel filtering in the first layer of the network, and shows that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target Speaker direction.
Attention-Based LSTM with Multi-Task Learning for Distant Speech Recognition
This paper explores the attention mechanism embedded within the long short-term memory (LSTM) based acoustic model for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM).
Phase-aware Speech Enhancement with Deep Complex U-Net
A novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure and achieves state-of-the-art performance in all metrics.
Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system
This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly
Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks
Several integration architectures are proposed and tested, including a pipeline architecture of L STM-based SE and ASR with sequence training, an alternating estimation architecture, and a multi-task hybrid LSTM network architecture.
Speaker Adapted Beamforming for Multi-Channel Automatic Speech Recognition
This paper presents a method to adapt a mask based, statistically optimal beamforming approach to a speaker of interest, and shows that this approach improves the ASR performance of a state-of-the-art multi-channel ASR system on the CHiME-4 data.
DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
A new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex- valued operation.