Label-free Knowledge Distillation with Contrastive Loss for Light-weight Speaker Recognition

  title={Label-free Knowledge Distillation with Contrastive Loss for Light-weight Speaker Recognition},
  author={Zhiyuan Peng and Xuanji He and Ke Ding and Tan Lee and Guanglu Wan},
  journal={2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)},
  • Zhiyuan PengXuanji He Guanglu Wan
  • Published 6 December 2022
  • Computer Science
  • 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)
Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these models for on-device applications with constrained computational resources. On the other hand, light-weight models are highly desired in practice despite their sub-optimal performance. This research aims to improve light-weight SR models through large-scale label-free knowledge distillation (KD). Existing KD approaches for SR typically… 

Figures and Tables from this paper



Knowledge Distillation for Small Foot-print Deep Speaker Embedding

Results show that the proposed knowledge distillation methods can significantly boost the performance of highly compact student models.

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

The limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV) are explored, especially with a well-recognized SOTA ASV model, ECAPA-TDNN, as a downstream model.

Learning Speaker Embedding with Momentum Contrast

Comparative study confirms the effectiveness of MoCo learning good speaker embedding and finetuning on the MoCo trained model reduces the equal error rate (EER) compared to a carefully tuned baseline training from scratch.

X-Vectors: Robust DNN Embeddings for Speaker Recognition

This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

Self-Supervised Text-Independent Speaker Verification Using Prototypical Momentum Contrastive Learning

A simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples, is examined.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

A new pre-trained model, WavLM, is proposed, to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.

Deep Normalization for Speaker Vectors

It is argued that deep speaker vectors require deep normalization, and a deepnormalization approach based on a novel discriminative normalization flow (DNF) model is proposed, which demonstrates the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora.

In defence of metric learning for speaker recognition

It is demonstrated that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with the proposed metric learning objective outperform state-of-the-art methods.

Towards Lightweight Applications: Asymmetric Enroll-Verify Structure for Speaker Verification

This paper has come up with an innovative asymmetric structure, which takes the large-scale ECAPA-TDNN model for enrollment and the small-scaleECAPA -TDNNLite model for verification for verification and reduces the EER to 2.31%.