PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification

  title={PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification},
  author={Siqi Zheng and Hongbin Suo and Qian Chen},
Speaker embedding has been a fundamental feature for speaker-related tasks such as verification, clustering, and diarization. Traditionally, speaker embeddings are represented as fixed vectors in high-dimensional space. This could lead to biased estimations, especially when handling shorter utterances. In this paper we propose to represent a speaker utterance as “floating” vector whose state is indeterminate without knowing the context. The state of a speaker representation is jointly determined… 

Figures and Tables from this paper


Speaker diarization using deep neural network embeddings
This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence, and then the generated multiple attractors are multiplied by the speechembedding sequence to produce the same number of speaker activities.
End-to-End Neural Speaker Diarization with Permutation-Free Objectives
Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.
Employing Phonetic Information in DNN Speaker Embeddings to Improve Speaker Recognition Performance
Experimental results show that exploiting phonetic information encoded in BFs is very valuable for DNN speaker embeddings, and enrichment of the BFs using a cascaded DNN multi-task architecture is shown to provide further improvements to the speaker embed- ding system.
An Iterative Framework for Self-Supervised Deep Speaker Representation Learning
  • Danwei Cai, Weiqing Wang, Ming Li
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
An iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN) and iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN.
End-to-End Neural Speaker Diarization with Self-Attention
The experimental results revealed that the self-attention was the key to achieving good performance and that the proposed EEND method performed significantly better than the conventional BLSTM-based method and was even better than that of the state-of-the-art x-vector clustering- based method.
A Real-Time Speaker Diarization System Based on Spatial Spectrum
A speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting is described and it is suggested that the system effectively incorporates spatial information and achieves significant gains.
A novel scheme for speaker recognition using a phonetically-aware deep neural network
We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.