Corpus ID: 237571714

Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition

@article{Weninger2021DualEncoderAW,
  title={Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition},
  author={Felix Weninger and Marco Gaudesi and Ralf Leibold and Roberto Gemello and Puming Zhan},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.08744}
}
In this paper, we propose a dual-encoder ASR architecture for joint modeling of close-talk (CT) and far-talk (FT) speech, in order to combine the advantages of CT and FT devices for better accuracy. The key idea is to add an encoder selection network to choose the optimal input source (CT or FT) and the corresponding encoder. We use a single-channel encoder for CT speech and a multi-channel encoder with Spatial Filtering neural beamforming for FT speech, which are jointly trained with the… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 41 REFERENCES
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
  • Qian Zhang, Han Lu, +4 authors Shankar Kumar
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Expand
End-To-End Multi-Speaker Speech Recognition With Transformer
TLDR
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals. Expand
End-to-End Multi-Channel Transformer for Speech Recognition
TLDR
This paper uses the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers to integrate different modalities using attention mechanisms. Expand
Exploring End-to-End Multi-Channel ASR with Bias Information for Meeting Transcription
TLDR
This work investigates the joint modeling of a mask-based beamformer and Attention-Encoder-Decoder-based ASR and proposes an effective location bias integration method called deep concatenation for the beamformer network, which achieves a substantial word error rate reduction. Expand
A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition
TLDR
The purpose of this paper is to quantify and characterize the performance gap between the two domains, setting up the basis for studying adaptation of speech recognizers from close-talking speech to distant speech. Expand
Multi-geometry Spatial Acoustic Modeling for Distant Speech Recognition
TLDR
This work proposes to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input and demonstrates the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data. Expand
Multi-Stream End-to-End Speech Recognition
TLDR
A multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information with relative WER reduction and relative Word Error Rate reduction is presented. Expand
Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition
TLDR
New acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly are developed and incorporated into the acoustic model. Expand
A Practical Two-Stage Training Strategy for Multi-Stream End-to-End Speech Recognition
TLDR
This work proposes a practical two-stage training scheme intending to solely train the attention fusion module using the UFE features and pretrained components from Stage-1, and achieves relative word error rate reductions of 8.2–32.4%, while consistently outperforming several conventional combination methods. Expand
Multi-channel Attention for End-to-End Speech Recognition
TLDR
This work proposes a sensory attention mechanism that is invariant to the channel ordering and only increases the overall parameter count by 0.09%, and demonstrates that even without re-training, this attention-equipped end-to-end model is able to deal with arbitrary numbers of input channels during inference. Expand
...
1
2
3
4
5
...