Corpus ID: 237563140

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

  title={Continuous Streaming Multi-Talker ASR with Dual-path Transducers},
  author={Desh Raj and Liang Lu and Zhuo Chen and Yashesh Gaur and Jinyu Li},
  • Desh Raj, Liang Lu, +2 authors Jinyu Li
  • Published 17 September 2021
  • Computer Science, Engineering
  • ArXiv
Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions. In this paper, we investigate it for multi-turn meetings containing multiple speakers using the Streaming Unmixing and Recognition Transducer (SURT) model, and show that naively extending the single-turn model to this harder setting incurs a performance penalty. As a solution, we propose the dual-path (DP) modeling strategy first used for timedomain speech separation. We… Expand
1 Citations

Figures and Tables from this paper

Recent Advances in End-to-End Automatic Speech Recognition
  • Jinyu Li
  • Computer Science, Engineering
  • ArXiv
  • 2021
This paper overviews the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective. Expand


Streaming End-to-End Multi-Talker Speech Recognition
This work proposes the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition, and investigates the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach to train this model. Expand
End-To-End Multi-Talker Overlapping Speech Recognition
  • Anshuman Tripathi, Han Lu, H. Sak
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
In this paper we present an end-to-end speech recognition system that can recognize single-channel speech where multiple talkers can speak at the same time (overlapping speech) by using a neuralExpand
End-to-end Monaural Multi-speaker ASR System without Pretraining
The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams and leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively. Expand
End-to-End Speaker-Attributed ASR with Transformer
This paper thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures, and proposes a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Expand
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention. Expand
Improving RNN Transducer Modeling for End-to-End Speech Recognition
This paperoptimizes the training algorithm of RNN-T to reduce the memory consumption so that it can have larger training minibatch for faster training speed and proposes better model structures so that Rnn-T models with the very good accuracy but small footprint are obtained. Expand
CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules. Expand
End-to-End Multi-Speaker Speech Recognition
This work develops the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals that enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data. Expand
Meeting Transcription Using Asynchronous Distant Microphones
A system that generates speaker-annotated transcripts of meetings by using multiple asynchronous distant microphones using continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization, and system combination is described. Expand
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets. Expand