• Corpus ID: 235755369

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

  title={A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio},
  author={Naoyuki Kanda and Xiong Xiao and Jian Wu and Tianyan Zhou and Yashesh Gaur and Xiaofei Wang and Zhong Meng and Zhuo Chen and Takuya Yoshioka},
Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize “who spoke what” from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches… 

Figures and Tables from this paper

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR
Experimental results on the LibriCSS and AMI corpora show that the proposed method achieves significantly better diarization error rate than various existing speaker diarized methods when the number of speakers is unknown, and achieves a comparable performance to TS-VAD when thenumber of speaker is given in advance.
Multi-turn RNN-T for streaming recognition of multi-party speech
This work proposes a novel multi-turn RNN-T (MT-RNN-t) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-Rnn-T).
Separating Long-Form Speech with Group-Wise Permutation Invariant Training
A novel training scheme named Group-PIT is proposed, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment and demonstrates the effectiveness of the proposed approaches, especially in dealing with a very long speech input.


Investigation of End-to-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings
This paper performs speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model to diarize the utterances of the speakers whose profiles are missing from the speaker inventory.
Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis
Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module, and an end-to-end modular system for the LibriCSS meeting data is proposed.
Hypothesis Stitcher for End-to-End Speaker-Attributed ASR on Long-Form Multi-Talker Recordings
Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.
End-to-end Monaural Multi-speaker ASR System without Pretraining
The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams and leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively.
End-to-End Speaker-Attributed ASR with Transformer
This paper thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures, and proposes a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions.
Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR
  • Naoyuki Kanda, Zhong Meng, +4 authors T. Yoshioka
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A speaker-attributed minimum Bayes risk (SA-MBR) training method where the parameters are trained to directly minimize the expected SA-WER over the training data.
Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models
This paper investigates the use of target-speaker automatic speech recognition for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings and proposes an iterative method, in which the estimation of speaker embeddings and TS-ASR based on the estimated speaker embeddeddings are alternately executed.
Joint Speech Recognition and Speaker Diarization via Sequence Transduction
This work proposes a novel approach to tackle the two tasks by a joint ASR and SD system using a recurrent neural network transducer that utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues.
A Purely End-to-End System for Multi-speaker Speech Recognition
Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective.
Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during training to regularize the network to achieve a better representation for speaker separation, thus achieving better accuracy on the target-speaker ASR.