Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization

@inproceedings{Yin2018NeuralST,
  title={Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization},
  author={Ruiqing Yin and Herv{\'e} Bredin and Claude Barras},
  booktitle={INTERSPEECH},
  year={2018}
}
Speaker diarization is the task of determining "who speaks when" in an audio stream. Most diarization systems rely on statistical models to address four sub-tasks: speech activity detection (SAD), speaker change detection (SCD), speech turn clustering, and re-segmentation. First, following the recent success of recurrent neural networks (RNN) for SAD and SCD, we propose to address re-segmentation with Long-Short Term Memory (LSTM) networks. Then, we propose to use affinity propagation on top of… 

Figures and Tables from this paper

Steps towards end-to-end neural speaker diarization. (Étapes vers un système neuronal de bout en bout pour la tâche de segmentation et de regroupement en locuteurs)

TLDR
This thesis proposes to improve the similarity matrix by bidirectional LSTM and then apply spectral clustering on top of the improved similarity matrix and achieves state-of-the-art performance in the CALLHOME telephone conversation dataset.

Efficient speaker diarization and low-latency speaker spotting. (Segmentation et regroupement efficaces en locuteurs et détection des locuteurs à faible latence)

TLDR
An approach to speaker modelling involving binary keys (BKs) is exploited, which leads to substantial improvements over baseline techniques and led to excellent results in 3 internationally competitive evaluations, including 2 best-ranked systems.

Speech Recognition and Multi-Speaker Diarization of Long Conversations

TLDR
To handle long conversations with unknown utterance boundaries, a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD.

LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization

TLDR
A supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM), which significantly outperforms the state-of-the-art methods and achieves a diarization error rate below average.

Efficient speaker diarization and low-latency speaker spotting

TLDR
The new task, coined low latency speaker spotting (LLSS), involves the rapid detection of known speakers within multi-speaker audio streams and involves the re-thinking of online diarization and the manner by which diarizing and detection sub-systems should best be combined.

Compositional Embedding Models for Speaker Identification and Diarization with Simultaneous Speech From 2+ Speakers

  • Zeqian LiJ. Whitehill
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
A new method for speaker diarization that can handle overlapping speech with 2+ people and outperforms traditional embedding methods that are only trained to separate single speakers (not speaker sets).

Speech Enhancement for Multimodal Speaker Diarization System

TLDR
The performance of the proposed multimodal speaker diarization system under noisy conditions is presented and the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the L STM model improves significantly as compared toWiener filter in terms of diarized error rate (DER).

Enhancements for Audio-only Diarization Systems

TLDR
Two different approaches to enhance the performance of the most challenging component of a Speaker Diarization system are presented, i.e. the speaker clustering part with a temporal smoothing process combined with nonlinear filtering and improvements on the Deep Embedded Clustering algorithm.

Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection

TLDR
A neural Long Short-Term Memory- based architecture for overlap detection is detail, which achieves state-of-the-art performance on the AMI, DIHARD, and ETAPE corpora and reveals promising directions for handling overlap.

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

TLDR
A novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarizing through a two streamed network which matches audio frames with their respective visual input segments.

References

SHOWING 1-10 OF 26 REFERENCES

Combining Speaker Turn Embedding and Incremental Structure Prediction for Low-Latency Speaker Diarization

TLDR
This paper addresses the issue of lowlatency speaker diarization that consists in continuously detecting new or reoccurring speakers within an audio stream, and determining when each speaker is active with a low latency (e.g. every second).

Speaker Diarization with LSTM

TLDR
This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.

Speaker Diarization: A Review of Recent Research

TLDR
An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.

Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks

TLDR
The result shows that the proposed model brings good improvement over conventional methods based on BIC and Gaussian Divergence.

Improving Speaker Diarization

TLDR
The improved LIMSI speaker diarization system used in the RT-04F evaluation reduces the speaker error time by over 75% on the development data, compared to the best configuration baseline system for this task.

Automatic Segmentation, Classification and Clustering of Broadcast News Audio

TLDR
This work describes the problems faced in adapting a system built to recognize one utterance at a time to a task that requires recognition of an entire half hour show, and shows that a priori knowledge of acoustic conditions and speakers in the broadcast data is not required for segmentation.

TristouNet: Triplet loss for speaker turn embedding

  • H. Bredin
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
Experiments on short speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.

An overview of automatic speaker diarization systems

TLDR
An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.

A novel speaker clustering algorithm via supervised affinity propagation

TLDR
A modified method, which automatically reruns the AP procedure to make the final number of clusters converge to the specified number, is proposed, which leads to a noticeable speaker purity improvement with slight cluster purity decrease compared with AP.

S4D: Speaker Diarization Toolkit in Python

TLDR
S4D provides various state-of-the-art components and the possibility to easily develop end-to-end diarization prototype systems and is an extension of the open-source toolkit for speaker recognition: SIDEKIT.