Online End-To-End Neural Diarization with Speaker-Tracing Buffer

  title={Online End-To-End Neural Diarization with Speaker-Tracing Buffer},
  author={Yawen Xue and Shota Horiguchi and Yusuke Fujita and Shinji Watanabe and Kenji Nagamatsu},
  journal={2021 IEEE Spoken Language Technology Workshop (SLT)},
This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker’s permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer… 
Online End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
An online end-to-end diarization that can handle overlapping speech and flexible numbers of speakers is proposed that achieves comparable performance to the offline EEND method and shows better performance on the DIHARD II dataset.
Block-Online Guided Source Separation
Evaluation on the CHiME-6 corpus and a meeting corpus showed that the proposed algorithm achieved almost the same performance as the conventional offline GSS algorithm but with 32x faster calculation, which is sufficient for real-time applications.
Speaker recognition based on deep learning: An overview
Several major subtasks of speaker recognition are reviewed, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods.
A Review of Speaker Diarization: Recent Advances with Deep Learning
This paper discusses how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other.
Configurable Privacy-Preserving Automatic Speech Recognition
It is shown that voice privacy can be configurable, and it is argued this presents new opportunities for privacy-preserving applications incorporating ASR.
DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding
This work introduces DIVE, an end-to-end speaker diarization algorithm that does not rely on pretrained speaker representations and optimizes all parameters of the system with a multi-speaker voice activity loss.
End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection
This paper proposes a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency, and outperforms conventional EEND systems in terms of diarized error rate.
Integrating End-to-End Neural and Clustering-Based Diarization: Getting the Best of Both Worlds
A simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers by modifying the conventional EEND framework to output global speaker embeddings so that speaker clustering can be performed across blocks based on a constrained clustering algorithm to solve the permutation problem.
Online Speaker Diarization Equipped with Discriminative Modeling and Guided Inference
This study proposes to endow diarization model with discriminability and to rectify less-reliable online inference with guidance to tackle the challenge from two perspectives, based on the current prior art, UIS-RNN.
Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
This paper proposes an iterative pseudolabel method for EEND, which trains the model using unlabeled data of a target condition, and proposes a committeebased training method to improve the performance of EEND.


End-to-End Neural Speaker Diarization with Permutation-Free Objectives
Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.
End-to-End Neural Speaker Diarization with Self-Attention
The experimental results revealed that the self-attention was the key to achieving good performance and that the proposed EEND method performed significantly better than the conventional BLSTM-based method and was even better than that of the state-of-the-art x-vector clustering- based method.
End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification
The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps, and the self-attention-based neural network was the key to achieving excellent performance.
Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data
  • Enrico Fini, A. Brutti
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
Qualitative modifications to the UIS-RNN model are proposed that significantly improve the learning efficiency and the overall diarization performance, and a novel loss function is introduced, called Sample Mean Loss, which presents a better modelling of the speaker turn behaviour.
Speaker diarization using deep neural network embeddings
This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Fully Supervised Speaker Diarization
A fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN), given extracted speaker-discriminative embeddings, which decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
Speaker Diarization with LSTM
This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
Low-latency speaker spotting with online diarization and detection
A need to improve the reliability of online diarization and detection, the proposed LLSS framework provides a vehicle to fuel future research in both areas and embraces a reproducible research policy.
Generalized End-to-End Loss for Speaker Verification
A new loss function called generalized end-to-end (GE2E) loss is proposed, which makes the training of speaker verification models more efficient than the previous tuple-based end- to- end (TE2e) loss function.
All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis
This paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation, using an NN-based estimator that operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources.