• Publications
  • Influence
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TLDR
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
MUSAN: A Music, Speech, and Noise Corpus
TLDR
This report introduces a new corpus of music, speech, and noise suitable for training models for voice activity detection (VAD) and music/speech discrimination and demonstrates use of this corpus on Broadcast news and VAD for speaker identification.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
TLDR
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
Deep neural network-based speaker embeddings for end-to-end speaker verification
TLDR
It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates.
Spoken Language Recognition using X-vectors
TLDR
This paper applies x-vectors to the task of spoken language recognition, and experiments with several variations of the x-vector framework, finding that the best performing system uses multilingual bottleneck features, data augmentation, and a discriminative Gaussian classifier.
Speaker Recognition for Multi-speaker Conversations Using X-vectors
TLDR
It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.
Speaker diarization using deep neural network embeddings
TLDR
This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Time delay deep neural network-based universal background models for speaker recognition
TLDR
This study investigates a lightweight alternative in which a supervised GMM is derived from the TDNN posteriors, which maintains the speed of the traditional unsupervised-GMM, but achieves a 20% relative improvement in EER.
Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification
TLDR
The proposed self-attentive speaker embedding system is compared with a strong DNN embedding baseline on NIST SRE 2016 and it is found that the self-ATTentive embeddings achieve superior performance.
Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge
TLDR
Several key aspects of currently state-of-the-art diarization methods, such as training data selection, signal bandwidth for feature extraction, representations of speech segments (i-vector versus x-vector), and domainadaptive processing are explored.
...
...