How to train your speaker embeddings extractor

@inproceedings{McLaren2018HowTT,
  title={How to train your speaker embeddings extractor},
  author={Mitchell McLaren and Diego Cast{\'a}n and Mahesh Kumar Nandwana and Luciana Ferrer and Emre Yilmaz},
  booktitle={Odyssey},
  year={2018}
}
With the recent introduction of speaker embeddings for text-independent speaker recognition, many fundamental questions require addressing in order to fast-track the development of this new era of technology. [] Key Method We lay out a set of recommendations for training the network based on the observed trends. By applying these recommendations to enhance the default recipe provided in the Kaldi toolkit, a significant gain of 13-21% on the Speakers in the Wild and NIST SRE’16 datasets is achieved.

Figures and Tables from this paper

Optimizing a Speaker Embedding Extractor Through Backend-Driven Regularization
TLDR
This work proposes one way to encourage the DNN to generate embeddings that are optimized for use in the PLDA backend, by adding a secondary objective designed to measure the performance of a such backend within the network.
How to Improve Your Speaker Embeddings Extractor in Generic Toolkits
TLDR
This paper focuses on the implementation of speaker embeddings extracted with deep neural networks on a more generic toolkit than Kaldi, and examines several tricks in training, such as the effects of normalizing input features and pooled statistics, different methods for preventing overfitting as well as alternative non-linearities that can be used instead of Rectifier Linear Units.
Analysis of Complementary Information Sources in the Speaker Embeddings Framework
TLDR
It is found that first and second embeddings layers are complementary in nature, and relative improvements in equal error rate of 17% on NIST SRE 2016 and 10% on SITW over the baseline system are demonstrated.
FOR MULTI-SPEAKER CONVERSATIONS USING X-VECTORS
TLDR
It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.
Adaptive Mean Normalization for Unsupervised Adaptation of Speaker Embeddings
TLDR
It is shown that the proposed adaptive mean normalization (AMN) technique is extremely effective for improving discrimination and calibration performance, by up to 26% and 65% relative over out-of-the-box system performance.
x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition
TLDR
This work presents a DNN refinement approach that updates a subset of the DNN parameters with full recordings to reduce this mismatch between training and inference when extracting embeddings for long duration recordings.
Speaker Recognition for Multi-speaker Conversations Using X-vectors
TLDR
It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.
Bandwidth Extension for Deep Speaker Embedding
TLDR
This paper investigates a novel data augmentation approach to train deep neural networks used for speaker embedding, i.e. to extract representation that allows easy comparison between speaker voices with a simple geometric operation to increase the number of speakers in the training set.
Review of different robust x-vector extractors for speaker verification
TLDR
This paper proposes to review and analyse the impact of the most significant x-vector related approaches, including variations in terms of data augmentation, number of epochs, size of mini-batch, acoustic features and frames per iteration, and observed a significant relative gain on Speaker in the Wild and Voxceleb1-E datasets.
Introducing phonetic information to speaker embedding for speaker verification
TLDR
Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline, and the c-vector system performs the best.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TLDR
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
TLDR
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
Deep neural network-based speaker embeddings for end-to-end speaker verification
TLDR
It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates.
Improving Robustness of Speaker Recognition to New Conditions Using Unlabeled Data
TLDR
This analysis shows that while the benefit of S-norm is also observed across other datasets, applying speaker-clustered calibration provides considerably greater benefit to the system in the context of new acoustic conditions.
Deep neural networks for small footprint text-dependent speaker verification
TLDR
Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task and is more robust to additive noise and outperforms the i- vector system at low False Rejection operating points.
A novel scheme for speaker recognition using a phonetically-aware deep neural network
We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for
Advances in deep neural network approaches to speaker recognition
TLDR
This work considers two approaches to DNN-based SID: one that uses the DNN to extract features, and another that uses a DNN during feature modeling, and several methods of DNN feature processing are applied to bring significantly greater robustness to microphone speech.
Improving robustness to compressed speech in speaker recognition
TLDR
It was found that robustness to compressed speech was marginally improved by exposing PLDA to noisy and reverberant speech, with little improvement using trancoded speech in PLDA based on codecs mismatched to the evaluation conditions.
The Speakers in the Wild (SITW) Speaker Recognition Database
The Speakers in the Wild (SITW) speaker recognition database contains hand-annotated speech samples from open-source media for the purpose of benchmarking text-independent speaker recognition
...
...