Corpus ID: 235755369

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

@article{Kanda2021ACS,
  title={A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio},
  author={Naoyuki Kanda and Xiong Xiao and Jian Wu and Tianyan Zhou and Yashesh Gaur and Xiaofei Wang and Zhong Meng and Zhuo Chen and Takuya Yoshioka},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.02852}
}
Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize “who spoke what” from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches… Expand

Figures and Tables from this paper

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR
TLDR
Experimental results on the LibriCSS and AMI corpora show that the proposed method achieves significantly better diarization error rate than various existing speaker diarized methods when the number of speakers is unknown, and achieves a comparable performance to TS-VAD when thenumber of speaker is given in advance. Expand
Separating Long-Form Speech with Group-Wise Permutation Invariant Training
  • Wangyou Zhang, Zhuo Chen, +8 authors Furu Wei
  • Engineering, Computer Science
  • ArXiv
  • 2021
TLDR
A novel training scheme named Group-PIT is proposed, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment and demonstrates the effectiveness of the proposed approaches, especially in dealing with a very long speech input. Expand

References

SHOWING 1-10 OF 57 REFERENCES
Investigation of End-to-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings
TLDR
This paper performs speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model to diarize the utterances of the speakers whose profiles are missing from the speaker inventory. Expand
Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis
TLDR
Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module, and an end-to-end modular system for the LibriCSS meeting data is proposed. Expand
Hypothesis Stitcher for End-to-End Speaker-Attributed ASR on Long-Form Multi-Talker Recordings
TLDR
Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings. Expand
End-to-end Monaural Multi-speaker ASR System without Pretraining
TLDR
The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams and leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively. Expand
End-to-End Speaker-Attributed ASR with Transformer
TLDR
This paper thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures, and proposes a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Expand
Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR
  • Naoyuki Kanda, Zhong Meng, +4 authors T. Yoshioka
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
A speaker-attributed minimum Bayes risk (SA-MBR) training method where the parameters are trained to directly minimize the expected SA-WER over the training data. Expand
Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models
TLDR
This paper investigates the use of target-speaker automatic speech recognition for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings and proposes an iterative method, in which the estimation of speaker embeddings and TS-ASR based on the estimated speaker embeddeddings are alternately executed. Expand
Joint Speech Recognition and Speaker Diarization via Sequence Transduction
TLDR
This work proposes a novel approach to tackle the two tasks by a joint ASR and SD system using a recurrent neural network transducer that utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues. Expand
A Purely End-to-End System for Multi-speaker Speech Recognition
TLDR
Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Expand
Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
TLDR
The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during training to regularize the network to achieve a better representation for speaker separation, thus achieving better accuracy on the target-speaker ASR. Expand
...
1
2
3
4
5
...