Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification

@inproceedings{Bhattacharya2018DeeplyFS,
  title={Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification},
  author={Gautam Bhattacharya and Md. Jahangir Alam and Vishwa Gupta and Patrick Kenny},
  booktitle={INTERSPEECH},
  year={2018}
}
Recently there has been a surge of interest is learning speaker embeddings using deep neural networks. These models ingest time-frequency representations of speech and can be trained to discriminate between a known set speakers. While embeddings learned in this way perform well, they typically require a large number of training data points for learning. In this work we propose deeply fused speaker embeddings speaker representations that combine neural speaker embeddings with i-vectors. We show… 

Figures and Tables from this paper

Combination of Deep Speaker Embeddings for Diarisation
Generative Adversarial Speaker Embedding Networks for Domain Robust End-to-end Speaker Verification
TLDR
A novel approach for learning domain-invariant speaker embeddings using Generative Adversarial Networks, able to match the performance of a strong baseline x-vector system and significantly boost verification performance by averaging the different GAN models at the score level.
Adapting End-to-end Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training
TLDR
This article applies speaker embeddings to the task of text-independent speaker verification, a challenging, real-world problem in biometric security by combing a novel 1-dimensional, self-attentive residual network, an angular margin loss function and adversarial training strategy.
SpeakerGAN: Recognizing Speakers in New Languages with Generative Adversarial Networks
TLDR
This work presents a flexible and interpretable framework for learning domain invariant speaker embeddings using Generative Adversarial Networks and shows that proposed adversarial speaker embedding models significantly reduce the distance between source and target data distributions, while performing similarly on the former and better on the latter.
Speaker Diarisation Using 2D Self-attentive Combination of Embeddings
  • Guangzhi Sun, Chao Zhang, P. Woodland
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
A generic framework to improve performance by combining them into a single embedding, referred to as a c-vector, is proposed, which extends the standard self-attentive layer by averaging not only across time but also across different types of embeddings.
An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales
  • Bin Gu, Wu Guo
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
An improved deep embedding learning method based on a convolutional neural network (CNN) for text-independent speaker verification and a Baum-Welch statistics attention (BWSA) mechanism is applied in the pooling layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layers.
Ensemble Additive Margin Softmax for Speaker Verification
  • Ya-Qi Yu, Lei Fan, Wu-Jun Li
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Experiments on a large-scale dataset VoxCeleb show that AM-Soft max loss is better than traditional loss functions, and approaches using EAM-Softmax loss can outperform existing speaker verification methods to achieve state-of-the-art performance.

References

SHOWING 1-10 OF 27 REFERENCES
Deep Neural Network Embeddings for Text-Independent Speaker Verification
TLDR
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
Deep Speaker: an End-to-End Neural Speaker Embedding System
TLDR
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TLDR
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Deep neural network-based speaker embeddings for end-to-end speaker verification
TLDR
It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates.
Deep Neural Network based Text-Dependent Speaker Recognition: Preliminary Results
TLDR
While the DNN models outperform the RNN, both models perform poorly compared to a GMM-UBM system, which serves as motivation for the further development of neural network based speaker verification approaches using global features.
Deep Speaker Feature Learning for Text-Independent Speaker Verification
TLDR
This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning that can produce high-quality speaker features and confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.
Front-End Factor Analysis For Speaker Verification
  • Florin Curelaru
  • Computer Science
    2018 International Conference on Communications (COMM)
  • 2018
TLDR
This paper investigates which configuration and which parameters lead to the best performance of an i-vectors/PLDA based speaker verification system and presents at the end some preliminary experiments in which the utterances comprised in the CSTR VCTK corpus were used besides utterances from MIT-MDSVC for training the total variability covariance matrix and the underlying PLDA matrices.
End-to-end text-dependent speaker verification
In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly
Improving DNN speaker independence with I-vector inputs
  • A. Senior, I. Lopez-Moreno
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs), and the algorithms are shown to combine well with speaker adaptation by backpropagation, resulting in a 9% relative WER reduction.
FaceNet: A unified embedding for face recognition and clustering
TLDR
A system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity, and achieves state-of-the-art face recognition performance using only 128-bytes perface.
...
...