Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

  title={Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios},
  author={Zhihao Du and Shiliang Zhang and Siqi Zheng and Zhijie Yan},
Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are explicitly modeled. Second, threshold selection is no longer needed. Through this formulation, we propose… 

Figures and Tables from this paper



Speaker diarization using deep neural network embeddings

This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.

End-to-End Neural Speaker Diarization with Permutation-Free Objectives

Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.

End-to-End Neural Speaker Diarization with Self-Attention

The experimental results revealed that the self-attention was the key to achieving good performance and that the proposed EEND method performed significantly better than the conventional BLSTM-based method and was even better than that of the state-of-the-art x-vector clustering- based method.

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence, and then the generated multiple attractors are multiplied by the speechembedding sequence to produce the same number of speaker activities.

Deep Neural Network Embeddings for Text-Independent Speaker Verification

It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.

Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

A novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame, outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

X-Vectors: Robust DNN Embeddings for Speaker Recognition

This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

The AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, is made available and the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) is launched with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field.

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for the M2met Challenge

  • Weiqing WangXiaoyi QinMing Li
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
An x-vector-based target-speaker voice activity detection (TS-VAD) with cross-channel self-attention is employed to improve the performance, where the non-linear spatial correlations between different channels are learned and fused.