Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization

  title={Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization},
  author={Aswin Shanmugam Subramanian and Chao Weng and Shinji Watanabe and Meng Yu and Yong Xu and Shi-Xiong Zhang and Dong Yu},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • A. Subramanian, Chao Weng, Dong Yu
  • Published 30 October 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end (E2E) neural network manner, called directional automatic speech recognition (D-ASR), which explicitly models source speaker locations. In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance. All three functionalities of D-ASR: localization, separation… 

Figures and Tables from this paper

Multi-Channel Multi-Speaker ASR Using 3D Spatial Feature
Experimental results show that the proposed ALL-In-One model achieved a comparable error rate to the pipelined system while reducing the inference time by half and the proposed 3D spatial feature significantly outperformed all previous works of using the 1D directional information in both paradigms.
Signal-Aware Direction-of-Arrival Estimation Using Attention Mechanisms
Exploring Multi-Channel Features for Speaker Verification with Joint VAD and Speech Enhancement.
The improvements from speaker-dependent directional features are more consistent in noisy condition than clean, and the learned multi-channel speaker embedding space can be made more discriminative through a constrastive loss based fine-tuning.
The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans
The recent development of ESPnet is described, an end-to-end speech processing toolkit that includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation.
FAST-RIR: Fast neural diffuse room impulse response generator
A neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment that is 400 times faster than a state of theart diffuse acoustic simulator (DAS) on a CPU and gives similar performance to DAS in ASR experiments.
A novel normalization algorithm to facilitate pre-assessment of Covid-19 disease by improving accuracy of CNN and its FPGA implementation
Infections of Covid-19 disease can be easily diagnosed with MVSR normalization technique, which increased the classification accuracy of the CNN model from 83.01, to 96.16% for binary class of chest X-ray images.


Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system
This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly
Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals
The ability of the proposed convolutional neural network based supervised learning method for estimating the direction of arrival (DOA) of multiple speakers to adapt to unseen acoustic conditions and its robustness to unseen noise type is demonstrated.
End-to-End Multi-Speaker Speech Recognition
This work develops the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals that enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation
This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives
  • A. Subramanian, Chao Weng, Dong Yu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper proposes a method to jointly optimize a location guided target speech extraction module along with a speech recognition module only with ASR error minimization criteria and designs a system that can take both location and anchor speech as input at the same time.
MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition
A novel neural sequence-to-sequence (seq2seq) architecture is proposed, MIMO-Speech, which extends the original seq2seq to deal with multi- Channel input and multi-channel output so that it can fully model multi-Channel multi-speaker speech separation and recognition.
End-To-End Multi-Speaker Speech Recognition With Transformer
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.
Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network
This work proposes a simple yet effective method for multi-channel far-field overlapped speech recognition that achieves more than 24% relative word error rate (WER) reduction than fixed beamforming with oracle selection.
Multichannel End-to-end Speech Recognition
The end-to-end framework for speech recognition is extended to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network, allowing the beamforming components to be optimized jointly within the recognition architecture to improve the end- to-end speech recognition objective.
Speech Enhancement Using End-to-End Speech Recognition Objectives
This paper uses a recently developed multichannel end-to-end (ME2E) system, which integrates neural dereverberation, beamforming, and attention-based speech recognition within a single neural network, and investigates how a system optimized based on the ASR objective improves the speech enhancement quality on various signal level metrics in addition to theASR word error rate (WER) metric.