• Corpus ID: 225062148

The IDLAB VoxCeleb Speaker Recognition Challenge 2020 System Description

@article{Thienpondt2020TheIV,
  title={The IDLAB VoxCeleb Speaker Recognition Challenge 2020 System Description},
  author={Jenthe Thienpondt and Brecht Desplanques and Kris Demuynck},
  journal={ArXiv},
  year={2020},
  volume={abs/2109.04070}
}
In this technical report we describe the IDLAB top-scoring submissions for the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) in the supervised and unsupervised speaker verification tracks. For the supervised verification tracks we trained 6 state-of-the-art ECAPA-TDNN systems and 4 Resnet34 based systems with architectural variations. On all models we apply a large margin fine-tuning strategy, which enables the training procedure to use higher margin penalties by using longer training… 

Figures and Tables from this paper

The JHU submission to VoxSRC-21: Track 3
TLDR
A recently proposed non-contrastive self-supervised method in computer vision (CV), distillation with no labels (DINO), is used to train the initial model, which outperformed the last year’s contrastive learning based on momentum contrast (MoCo).
The Phonexia VoxCeleb Speaker Recognition Challenge 2021 System Description
TLDR
The Phonexia submission for the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21) in the selfsupervised speaker verification track is described and unsuccessful solutions involving i-vectors instead of DNN embeddings and PLDA instead of cosine scoring are described.
Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information
TLDR
This paper contains a post-challenge performance analysis on crosslingual speaker verification of the IDLab submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21) and proposes two techniques to increase cross-lingUAL speaker verification robustness.
The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021
TLDR
This report explores several parts, including data augmentation, network structures, domain-based large margin fine-tuning, and back-end refinement of the VoxCeleb Speaker Recognition Challenge 2021 submission, which is a fusion of 9 models.
Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification
TLDR
This paper describes the IDLab submission for the textindependent task of the Short-duration Speaker Verification Challenge 2021 and proposes a frequency-wise variant of Squeeze-Excitation (SE) which better preserves frequency-specific information when rescaling the feature maps.
VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge
The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition
VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge
TLDR
The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data.
North America Bixby Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2021
This paper describes the submission to the speaker diarization track of VoxCeleb Speaker Recognition Challenge 2021 done by North America Bixby Lab of Samsung Research America. Our speaker
Self-supervised Speaker Recognition with Loss-gated Learning
TLDR
It is observed that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels, which motivates this work to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training.
Studying Squeeze-and-Excitation Used in CNN for Speaker Verification
TLDR
Results showed that applying SE only on the first stages of the ResNet allows to better capture speaker information for the verification task, and that significant discrimination gains on Voxceleb1-E, VoxceleB1-H and SITW evaluation tasks have been noted using the proposed pooling variant.
...
1
2
...

References

SHOWING 1-10 OF 37 REFERENCES
Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification
TLDR
This paper describes the IDLab submission for the textindependent task of the Short-duration Speaker Verification Challenge 2021 and proposes a frequency-wise variant of Squeeze-Excitation (SE) which better preserves frequency-specific information when rescaling the feature maps.
VoxCeleb: A Large-Scale Speaker Identification Dataset
TLDR
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.
Augmentation adversarial training for unsupervised speaker recognition
TLDR
The goal of this work is to train robust speaker recognition models without speaker labels by proposing augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied.
VoxCeleb2: Deep Speaker Recognition
TLDR
A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.
Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition
  • Nakamasa Inoue, Keita Goto
  • Computer Science
    2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2020
TLDR
A semi-supervised contrastive learning framework and its application to text-independent speaker verification and the proposed framework employs generalized contrastive loss (GCL), which enables the learning of speaker embeddings in three manners, supervised learning, semi- supervised learning, and unsupervised learning.
Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification
TLDR
Two approaches for modeling long temporal contexts to improve the performance of the ResNet networks are explored and the BLSTM and ResNet are combined into one unified architecture.
Comparison of Speaker Recognition Approaches for Real Applications
TLDR
This paper describes the experimental setup and the results obtained using several state-of-the-art speaker recognition classifiers, and shows that the classifiers based on i-vectors obtain the best recognition and calibration accuracy.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition
TLDR
A magnitude estimation network that is combined with a modified ResNet x-vector system to generate embeddings whose inner product is able to produce calibrated scores with increased discrimination and calibration gains at multiple operating points is presented.
A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: The Deepmine Database
TLDR
The database can serve for training robust ASR models and several evaluation protocols for each part of the database are provided to allow for research on different aspects of speaker verification.
...
1
2
3
4
...