Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization

  title={Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization},
  author={Ruiqing Yin and Herv{\'e} Bredin and Claude Barras},
Speaker diarization is the task of determining "who speaks when" in an audio stream. Most diarization systems rely on statistical models to address four sub-tasks: speech activity detection (SAD), speaker change detection (SCD), speech turn clustering, and re-segmentation. First, following the recent success of recurrent neural networks (RNN) for SAD and SCD, we propose to address re-segmentation with Long-Short Term Memory (LSTM) networks. Then, we propose to use affinity propagation on top of… 

Figures and Tables from this paper

Steps towards end-to-end neural speaker diarization. (Étapes vers un système neuronal de bout en bout pour la tâche de segmentation et de regroupement en locuteurs)

This thesis proposes to improve the similarity matrix by bidirectional LSTM and then apply spectral clustering on top of the improved similarity matrix and achieves state-of-the-art performance in the CALLHOME telephone conversation dataset.

Efficient speaker diarization and low-latency speaker spotting. (Segmentation et regroupement efficaces en locuteurs et détection des locuteurs à faible latence)

An approach to speaker modelling involving binary keys (BKs) is exploited, which leads to substantial improvements over baseline techniques and led to excellent results in 3 internationally competitive evaluations, including 2 best-ranked systems.

Speech Recognition and Multi-Speaker Diarization of Long Conversations

To handle long conversations with unknown utterance boundaries, a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD.

LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization

A supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM), which significantly outperforms the state-of-the-art methods and achieves a diarization error rate below average.

Efficient speaker diarization and low-latency speaker spotting

The new task, coined low latency speaker spotting (LLSS), involves the rapid detection of known speakers within multi-speaker audio streams and involves the re-thinking of online diarization and the manner by which diarizing and detection sub-systems should best be combined.

Compositional Embedding Models for Speaker Identification and Diarization with Simultaneous Speech From 2+ Speakers

  • Zeqian LiJ. Whitehill
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A new method for speaker diarization that can handle overlapping speech with 2+ people and outperforms traditional embedding methods that are only trained to separate single speakers (not speaker sets).

Speech Enhancement for Multimodal Speaker Diarization System

The performance of the proposed multimodal speaker diarization system under noisy conditions is presented and the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the L STM model improves significantly as compared toWiener filter in terms of diarized error rate (DER).

Enhancements for Audio-only Diarization Systems

Two different approaches to enhance the performance of the most challenging component of a Speaker Diarization system are presented, i.e. the speaker clustering part with a temporal smoothing process combined with nonlinear filtering and improvements on the Deep Embedded Clustering algorithm.

Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection

A neural Long Short-Term Memory- based architecture for overlap detection is detail, which achieves state-of-the-art performance on the AMI, DIHARD, and ETAPE corpora and reveals promising directions for handling overlap.

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

A novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarizing through a two streamed network which matches audio frames with their respective visual input segments.



Combining Speaker Turn Embedding and Incremental Structure Prediction for Low-Latency Speaker Diarization

This paper addresses the issue of lowlatency speaker diarization that consists in continuously detecting new or reoccurring speakers within an audio stream, and determining when each speaker is active with a low latency (e.g. every second).

Speaker Diarization with LSTM

This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.

Speaker Diarization: A Review of Recent Research

An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.

Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks

The result shows that the proposed model brings good improvement over conventional methods based on BIC and Gaussian Divergence.

Improving Speaker Diarization

The improved LIMSI speaker diarization system used in the RT-04F evaluation reduces the speaker error time by over 75% on the development data, compared to the best configuration baseline system for this task.

Automatic Segmentation, Classification and Clustering of Broadcast News Audio

This work describes the problems faced in adapting a system built to recognize one utterance at a time to a task that requires recognition of an entire half hour show, and shows that a priori knowledge of acoustic conditions and speakers in the broadcast data is not required for segmentation.

TristouNet: Triplet loss for speaker turn embedding

  • H. Bredin
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
Experiments on short speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.

An overview of automatic speaker diarization systems

An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.

A novel speaker clustering algorithm via supervised affinity propagation

A modified method, which automatically reruns the AP procedure to make the final number of clusters converge to the specified number, is proposed, which leads to a noticeable speaker purity improvement with slight cluster purity decrease compared with AP.

S4D: Speaker Diarization Toolkit in Python

S4D provides various state-of-the-art components and the possibility to easily develop end-to-end diarization prototype systems and is an extension of the open-source toolkit for speaker recognition: SIDEKIT.