• Corpus ID: 238856799

Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization

  title={Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization},
  author={Yechan Yu and Dongkeon Park and Hong Kook Kim},
  • Yechan Yu, Dongkeon Park, H. Kim
  • Published 14 October 2021
  • Computer Science, Engineering
  • ArXiv
End-to-end neural diarization (EEND) with self-attention directly predicts speaker labels from inputs and enables the handling of overlapped speech. Although the EEND outperforms clustering-based speaker diarization (SD), it cannot be further improved by simply increasing the number of encoder blocks because the last encoder block is dominantly supervised compared with lower blocks. This paper proposes a new residual auxiliary EEND (RX-EEND) learning architecture for transformers to enforce the… 

Figures and Tables from this paper


End-to-End Neural Speaker Diarization with Self-Attention
The experimental results revealed that the self-attention was the key to achieving good performance and that the proposed EEND method performed significantly better than the conventional BLSTM-based method and was even better than that of the state-of-the-art x-vector clustering- based method.
End-to-end Neural Diarization: From Transformer to Conformer
By mixing simulated and real data in EEND training, this work mitigate the mismatch between simulated data and real speaker behavior in terms of temporal statistics reflecting turn-taking between speakers, and investigates its correlation with diarization error.
End-to-End Neural Speaker Diarization with Permutation-Free Objectives
Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.
End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
An end-to-end deep network model that performs meeting diarization from single-channel audio recordings, designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
Speaker diarization using deep neural network embeddings
This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Generalized End-to-End Loss for Speaker Verification
A new loss function called generalized end-to-end (GE2E) loss is proposed, which makes the training of speaker verification models more efficient than the previous tuple-based end- to- end (TE2e) loss function.
X-Vectors: Robust DNN Embeddings for Speaker Recognition
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Speaker Diarization with LSTM
This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
Speaker Diarization: A Review of Recent Research
An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.