Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information

  title={Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information},
  author={Jenthe Thienpondt and Brecht Desplanques and Kris Demuynck},
This paper contains a post-challenge performance analysis on crosslingual speaker verification of the IDLab submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We show that current speaker embedding extractors consistently underestimate speaker similarity in within-speaker cross-lingual trials. Consequently, the typical training and scoring protocols do not put enough emphasis on the compensation of intra-speaker language variability. We propose two techniques to increase… 

Figures and Tables from this paper


The IDLAB VoxCeleb Speaker Recognition Challenge 2020 System Description
This technical report describes the IDLAB top-scoring submissions for the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) in the supervised and unsupervised speaker verification tracks with a large margin fine-tuning strategy.
Speaker Recognition for Multi-speaker Conversations Using X-vectors
It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.
Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification
This paper describes the IDLab submission for the textindependent task of the Short-duration Speaker Verification Challenge 2021 and proposes a frequency-wise variant of Squeeze-Excitation (SE) which better preserves frequency-specific information when rescaling the feature maps.
Analysis of Score Normalization in Multilingual Speaker Recognition
The analysis shows that the adaptive score normalization (using top scoring files per trial) selects cohorts that in 68% contain recordings from the same language and in 92% of the same gender as the enrollment and test recordings.
The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021
This report explores several parts, including data augmentation, network structures, domain-based large margin fine-tuning, and back-end refinement of the VoxCeleb Speaker Recognition Challenge 2021 submission, which is a fusion of 9 models.
VoxCeleb: A Large-Scale Speaker Identification Dataset
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.
X-Vectors: Robust DNN Embeddings for Speaker Recognition
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Comparison of Speaker Recognition Approaches for Real Applications
This paper describes the experimental setup and the results obtained using several state-of-the-art speaker recognition classifiers, and shows that the classifiers based on i-vectors obtain the best recognition and calibration accuracy.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
VOXLINGUA107: A Dataset for Spoken Language Recognition
This paper generates semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages and uses the data to build language recognition models for several spoken language identification tasks.