Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization

  title={Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization},
  author={Mufan Sang and Haoqi Li and F. Liu and Andrew O. Arnold and Li Wan},
Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facili-tate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclu-sively on the similarity between the… 

Figures and Tables from this paper

Self-supervised curriculum learning for speaker verification
This work adapts the DINO framework for speaker recognition, in which the model is trained without exploiting negative utterance pairs, and proposes two curriculum learning strategies where one gradually increases the number of speakers in training dataset, and the other gradually applies augmentations within a mini-batch as the training proceeds.
Raw waveform speaker verification for supervised and self-supervised learning
This paper proposes a new raw waveform speaker verification model that incorporates techniques proven effective for speaker verification, including the Res2Net backbone module and the aggregation method considering both context and channels, and shows the state-of-the-art performance.
Pushing the limits of raw waveform speaker recognition
The proposed speaker recognition model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbone module and multi-layer feature aggregation and outperforms the best model based on raw waveform inputs by a large margin.
Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning
It is shown that GAP is a special case of a discrete cosine transform (DCT) on time-frequency domain mathematically using only the lowest frequency component in frequency decomposition, which is in-capable of preserving sufficient speaker information in the feature maps.


Self-Supervised Text-Independent Speaker Verification Using Prototypical Momentum Contrastive Learning
A simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples, is examined.
Contrastive Self-Supervised Learning for Text-Independent Speaker Verification
This work exploits a contrastive self-supervised learning (CSSL) approach for text-independent speaker verification task and proposes channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation.
Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition
  • Nakamasa Inoue, Keita Goto
  • Computer Science
    2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2020
A semi-supervised contrastive learning framework and its application to text-independent speaker verification and the proposed framework employs generalized contrastive loss (GCL), which enables the learning of speaker embeddings in three manners, supervised learning, semi- supervised learning, and unsupervised learning.
Augmentation adversarial training for unsupervised speaker recognition
The goal of this work is to train robust speaker recognition models without speaker labels by proposing augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied.
Learning Speaker Embedding with Momentum Contrast
Comparative study confirms the effectiveness of MoCo learning good speaker embedding and finetuning on the MoCo trained model reduces the equal error rate (EER) compared to a carefully tuned baseline training from scratch.
Generative Adversarial Speaker Embedding Networks for Domain Robust End-to-end Speaker Verification
A novel approach for learning domain-invariant speaker embeddings using Generative Adversarial Networks, able to match the performance of a strong baseline x-vector system and significantly boost verification performance by averaging the different GAN models at the score level.
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
A novel training strategy is proposed that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities to outperforms state-of-the-art self-supervised baselines.
Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
A self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video to tease apart the representations of linguistic content and speaker identity without access to manually annotated data is developed.
DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning
  • Mufan Sang, Wei Xia, J. Hansen
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A novel framework to disentangle speaker-related and domain-specific features and apply domain adaptation on the speaker- related feature space solely, which can effectively generate more speaker-discriminative anddomain-invariant speaker representations with a relative 20.3% reduction of EER to the original ResNet-based system.
Open-set Short Utterance Forensic Speaker Verification using Teacher-Student Network with Explicit Inductive Bias
It is shown that the proposed objective function can efficiently improve the performance of teacher-student learning on short utterances and that the fine-tuning strategy outperforms the commonly used weight decay method by providing an explicit inductive bias towards the pre-trained model.