An End-to-End Text-Independent Speaker Identification System on Short Utterances

@inproceedings{Ji2018AnET,
  title={An End-to-End Text-Independent Speaker Identification System on Short Utterances},
  author={Ruifang Ji and Xinyuan Cai and Bo Xu},
  booktitle={INTERSPEECH},
  year={2018}
}
In the field of speaker recognition, text-independent speaker identification on short utterances is still a challenging task, since it is rather tough to extract a robust and dicriminative speaker feature in short duration condition. This paper explores an endto-end speaker identification system, which maps utterances to a speaker identity subspace where the similarity of speakers can be measured by Euclidean distance. To be specific, we apply stacked gated recurrent unit (GRU) architectures to… 

Figures and Tables from this paper

Length- and Noise-Aware Training Techniques for Short-Utterance Speaker Recognition

TLDR
This work starts with early work on including invariant representation learning (IRL) to the loss function and modify the approach with centroid alignment (CA) and length variability cost (LVC) techniques to further improve robustness in noisy, far-field applications.

Speaker recognition based on deep learning: An overview

Text-Independent Speaker Verification with Adversarial Learning on Short Utterances

  • Kai LiuHuan Zhou
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper proposes an adversarially learned embedding mapping model that directly maps a short embedding to an enhanced embedding with increased discriminability and investigates the effectiveness of those loss criteria by conducting numerous ablation studies.

SpeechNAS: Towards Better Trade-Off Between Latency and Accuracy for Large-Scale Speaker Verification

TLDR
The derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCelebl, which surpasses previous TDNN based state-of-the-art approaches by a large margin.

A Hybrid GRU-CNN Feature Extraction Technique for Speaker Identification

TLDR
An end-to-end speaker identification pipeline introducing a hybrid Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN) feature extraction technique is demonstrated.

Spoken Language Identification with Deep Temporal Neural Network and Multi-levels Discriminative Cues

  • Linjia Sun
  • Linguistics, Computer Science
    2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)
  • 2020
TLDR
The language cue is an important component in the task of spoken language identification (LID), but it will take a lot of time to align language cue to speech segment by the manual annotation of professional linguists so a novel LID system is proposed based on the architecture of TDNN followed by LSTM-RNN.

Deep learning methods in speaker recognition: a review

TLDR
This paper reviews the applied Deep Learning practices in the field of Speaker Recognition, both in verification and identification, and seems that Deep Learning becomes the now state-of-the-art solution for both Speaker Verification (SV) and identification.

Large Margin Softmax Loss for Speaker Verification

TLDR
Ring loss and minimum hyperspherical energy criterion are introduced to further improve the performance of the large margin softmax loss with different configurations in speaker verification.

Language Identification with Unsupervised Phoneme-like Sequence and TDNN-LSTM-RNN

  • Linjia Sun
  • Computer Science
    2020 15th IEEE International Conference on Signal Processing (ICSP)
  • 2020
TLDR
The experimental results show that the proposed LID method provides competitive performance with the existing methods in the LID task and helps to capture robust discriminative information for short duration language identification and high accuracy for dialect identification.

Dynamic Margin Softmax Loss for Speaker Verification

TLDR
A dynamic-margin softmax loss for the training of deep speaker embedding neural network that dynamically set the margin of each training sample commensurate with the cosine angle of that sample, hence, the name dynamic-additive margin softmax (DAM-Softmax) loss.

References

SHOWING 1-10 OF 27 REFERENCES

End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances

TLDR
An end-to-end system which directly learns a mapping from speech features to a compact fixed length speaker discriminative embedding where the Euclidean distance is employed for measuring similarity within trials.

End-to-End attention based text-dependent speaker verification

TLDR
A new type of End-to-End system for text-dependent speaker verification is presented, using a speaker discriminate CNN to extract the noise-robust frame-level features and the proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights.

Deep neural network-based speaker embeddings for end-to-end speaker verification

TLDR
It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates.

Deep Speaker: an End-to-End Neural Speaker Embedding System

TLDR
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.

End-to-end text-dependent speaker verification

In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly

Front-End Factor Analysis For Speaker Verification

  • Florin Curelaru
  • Computer Science
    2018 International Conference on Communications (COMM)
  • 2018
TLDR
This paper investigates which configuration and which parameters lead to the best performance of an i-vectors/PLDA based speaker verification system and presents at the end some preliminary experiments in which the utterances comprised in the CSTR VCTK corpus were used besides utterances from MIT-MDSVC for training the total variability covariance matrix and the underlying PLDA matrices.

Deep feature for text-dependent speaker verification

Generalized End-to-End Loss for Speaker Verification

TLDR
A new loss function called generalized end-to-end (GE2E) loss is proposed, which makes the training of speaker verification models more efficient than the previous tuple-based end- to- end (TE2e) loss function.

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

TLDR
This paper proposes an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections, and argues that CNNs have the capability to model temporal correlations with appropriate context information.

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.