Learning Discriminative Features for Speaker Identification and Verification

  title={Learning Discriminative Features for Speaker Identification and Verification},
  author={Sarthak Yadav and Atul Kumar Rai},
The success of any Text Independent Speaker Identification and/or Verification system relies upon the system’s capability to learn discriminative features. In this paper we propose a Convolutional Neural Network (CNN) Architecture based on the popular Very Deep VGG [1] CNNs, with key modifications to accommodate variable length spectrogram inputs, reduce the model disk space requirements and reduce the number of parameters, resulting in significant reduction in training times. We also propose a… 

Tables from this paper

Deep multi-metric learning for text-independent speaker verification

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

  • Sarthak YadavA. Rai
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the VoxCeleb benchmark, concluding that simultaneously modelling temporal and frequency attention translates to better real-world performance.

Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Experiments on the VoxCeleb1 dataset show that the proposed system using the SPE layer and ring loss-based deep length normalization outperforms both i-vector and d-vector baselines.

Training Speaker Recognition Models with Recording-Level Labels

  • Tanel Alumäe
  • Computer Science
    2018 IEEE Spoken Language Technology Workshop (SLT)
  • 2018
It is shown that without using any reference segment-level labeling, the weakly supervised training method can achieve 1% speaker recognition error rate on the official VoxCeleb closed set speaker recognition test set, as opposed to 5.4% that was previously reported.

Deep learning methods in speaker recognition: a review

This paper reviews the applied Deep Learning practices in the field of Speaker Recognition, both in verification and identification, and seems that Deep Learning becomes the now state-of-the-art solution for both Speaker Verification (SV) and identification.

Angular Margin Centroid Loss for Text-Independent Speaker Recognition

This paper optimize the cosine distances between speaker embeddings and their corresponding centroids rather than the weight vectors in the classification layer to enhance the intra-class compactness of speaker embedding and explicitly improve the inter-class separability.

Learning Discriminative features using Center Loss and Reconstruction as Regularizer for Speech Emotion Recognition

A Convolutional Neural Network inspired by Multitask Learning (MTL) and based on speech features trained under the joint supervision of softmax loss and center loss for the recognition of emotion in speech is proposed.

Introducing phonetic information to speaker embedding for speaker verification

Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline, and the c-vector system performs the best.



VoxCeleb: A Large-Scale Speaker Identification Dataset

This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.

Deep Neural Network Embeddings for Text-Independent Speaker Verification

It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.

Speaker identification and clustering using convolutional neural networks

This paper uses simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering, and demonstrates the approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.

Deep Speaker: an End-to-End Neural Speaker Embedding System

Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.

Front-End Factor Analysis For Speaker Verification

  • Florin Curelaru
  • Computer Science
    2018 International Conference on Communications (COMM)
  • 2018
This paper investigates which configuration and which parameters lead to the best performance of an i-vectors/PLDA based speaker verification system and presents at the end some preliminary experiments in which the utterances comprised in the CSTR VCTK corpus were used besides utterances from MIT-MDSVC for training the total variability covariance matrix and the underlying PLDA matrices.

A Discriminative Feature Learning Approach for Deep Face Recognition

This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

This paper proposes an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections, and argues that CNNs have the capability to model temporal correlations with appropriate context information.

CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

Finding Difficult Speakers in Automatic Speaker Recognition

The phenomenon that some speakers within a given population have a tendency to cause a large proportion of errors, and ways of finding such speakers are investigated, as well as a straightforward approach to predict speakers that will be difficult for a system to correctly recognize.

DNN Bottleneck Features for Speaker Clustering

This work analyzes the bottleneck features obtained for speaker recognition and test them in a speaker clustering scenario to observe that there are deep neural network topologies that work better for both cases, even when their classification criteria is loosely met.