X-Vectors: Robust DNN Embeddings for Speaker Recognition

  title={X-Vectors: Robust DNN Embeddings for Speaker Recognition},
  author={David Snyder and Daniel Garcia-Romero and Gregory Sell and Daniel Povey and Sanjeev Khudanpur},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. [] Key Method We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector…

Figures and Tables from this paper

Embeddings for DNN Speaker Adaptive Training
The performance for speaker recognition of a given representation is not correlated with its ASR performance; in fact, ability to capture more speech attributes than just speaker identity was the most important characteristic of the embed-dings for efficient DNN-SAT ASR.
Neural i-vectors
The deep embeddings compared to the proposed neural i-vectors on the Speakers in the Wild (SITW) and the Speaker Recognition Evaluation (SRE) 2018 and 2019 datasets obtain performance comparative to the state-of-the-art.
Self-supervised speaker embeddings
This paper proposes to train speaker embedding extractors via reconstructing the frames of a target speech segment, given the inferred embedding of another speech segment of the same utterance, by attaching to the standard speaker embeding extractor a decoder network.
DNN Speaker Embeddings Using Autoencoder Pre-Training
By initializing DNN with the parameters of the pre-trained autoencoder, this paper has achieved a relative improvement of 21%, in terms of Equal Error Rate (EER), over the baseline i-vector/PLDA system.
Designing Neural Speaker Embeddings with Meta Learning
An open source toolkit to train x-vectors that is matched in performance with pre-trained Kaldi models for speaker diarization and speaker verification applications is developed and two meta-learning strategies are used to improve over the x-vector embeddings.
Optimizing a Speaker Embedding Extractor Through Backend-Driven Regularization
This work proposes one way to encourage the DNN to generate embeddings that are optimized for use in the PLDA backend, by adding a secondary objective designed to measure the performance of a such backend within the network.
Speaker-Aware Training of Attention-Based End-to-End Speech Recognition Using Neural Speaker Embeddings
  • Aku Rouhe, Tuomas Kaseva, M. Kurimo
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
In speaker-aware training, a speaker embedding is appended to DNN input features, which allows the DNN to effectively learn representations, which are robust to speaker variability, and it is shown that it can improve over a purely end-to-end baseline.
Few Shot Speaker Recognition using Deep Neural Networks
This paper proposes to identify speakers by learning from only a few training examples, using a deep neural network with prototypical loss where the input to the network is a spectrogram, and utilizes an auto-encoder to learn generalized feature embeddings from class-specific embedDings obtained from capsule network.
Probing the Information Encoded in X-Vectors
Simple classifiers are used to investigate the contents encoded by x-vector embeddings for information related to the speaker, channel, transcription, and meta information about the utterance and compare these with the information encoded by i-vectors across a varying number of dimensions.


Deep Neural Network Embeddings for Text-Independent Speaker Verification
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
Time delay deep neural network-based universal background models for speaker recognition
This study investigates a lightweight alternative in which a supervised GMM is derived from the TDNN posteriors, which maintains the speed of the traditional unsupervised-GMM, but achieves a 20% relative improvement in EER.
Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition
Although the proposed i-vectors yield inferior performance compared to the standard ones, they are capable of attaining 16% relative improvement when fused with them, meaning that they carry useful complementary information about the speaker’s identity.
Deep neural network-based speaker embeddings for end-to-end speaker verification
It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates.
Improving speaker recognition performance in the domain adaptation challenge using deep neural networks
This paper explores the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC), and shows that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out- of-domain system by more than 25%.
Learning Speaker-Specific Characteristics With a Deep Neural Architecture
  • A. Salman
  • Computer Science
    IEEE Transactions on Neural Networks
  • 2011
A novel deep neural architecture especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker- specific overcomplete representation.
Advances in deep neural network approaches to speaker recognition
This work considers two approaches to DNN-based SID: one that uses the DNN to extract features, and another that uses a DNN during feature modeling, and several methods of DNN feature processing are applied to bring significantly greater robustness to microphone speech.
End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances
An end-to-end system which directly learns a mapping from speech features to a compact fixed length speaker discriminative embedding where the Euclidean distance is employed for measuring similarity within trials.
A novel scheme for speaker recognition using a phonetically-aware deep neural network
We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for
Front-End Factor Analysis For Speaker Verification
  • Florin Curelaru
  • Computer Science
    2018 International Conference on Communications (COMM)
  • 2018
This paper investigates which configuration and which parameters lead to the best performance of an i-vectors/PLDA based speaker verification system and presents at the end some preliminary experiments in which the utterances comprised in the CSTR VCTK corpus were used besides utterances from MIT-MDSVC for training the total variability covariance matrix and the underlying PLDA matrices.