PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification
@article{Zheng2022PRISMPI, title={PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification}, author={Siqi Zheng and Hongbin Suo and Qian Chen}, journal={ArXiv}, year={2022}, volume={abs/2205.07450} }
Speaker embedding has been a fundamental feature for speaker-related tasks such as verification, clustering, and diarization. Traditionally, speaker embeddings are represented as fixed vectors in high-dimensional space. This could lead to biased estimations, especially when handling shorter utterances. In this paper we propose to represent a speaker utterance as “floating” vector whose state is indeterminate without knowing the context. The state of a speaker representation is jointly determined…
References
SHOWING 1-10 OF 25 REFERENCES
Speaker diarization using deep neural network embeddings
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
- Computer ScienceINTERSPEECH
- 2017
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
- Computer ScienceINTERSPEECH
- 2020
This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence, and then the generated multiple attractors are multiplied by the speechembedding sequence to produce the same number of speaker activities.
End-to-End Neural Speaker Diarization with Permutation-Free Objectives
- Computer ScienceINTERSPEECH
- 2019
Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.
Employing Phonetic Information in DNN Speaker Embeddings to Improve Speaker Recognition Performance
- Computer ScienceINTERSPEECH
- 2018
Experimental results show that exploiting phonetic information encoded in BFs is very valuable for DNN speaker embeddings, and enrichment of the BFs using a cascaded DNN multi-task architecture is shown to provide further improvements to the speaker embed- ding system.
An Iterative Framework for Self-Supervised Deep Speaker Representation Learning
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
An iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN) and iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN.
End-to-End Neural Speaker Diarization with Self-Attention
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
The experimental results revealed that the self-attention was the key to achieving good performance and that the proposed EEND method performed significantly better than the conventional BLSTM-based method and was even better than that of the state-of-the-art x-vector clustering- based method.
A Real-Time Speaker Diarization System Based on Spatial Spectrum
- PhysicsICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
A speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting is described and it is suggested that the system effectively incorporates spatial information and achieves significant gains.
A novel scheme for speaker recognition using a phonetically-aware deep neural network
- Computer Science2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014
We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for…
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
- Computer ScienceINTERSPEECH
- 2020
The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.