DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

  title={DT-SV: A Transformer-based Time-domain Approach for Speaker Verification},
  author={Nan Zhang and Jianzong Wang and Zhenhou Hong and Chendong Zhao and Xiaoyang Qu and Jing Xiao},
  journal={2022 International Joint Conference on Neural Networks (IJCNN)},
Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to… 
1 Citations

Figures and Tables from this paper

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

This work integrates sparse attention and monotonic attention into Transformer-based ASR, and shows that the method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.



S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

The speaker embeddings obtained from the proposed speaker classification model are referred to as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention and it is demonstrated that the performance of s-vesctors with TESA is better than s-VEctors with conventional PLDA-based speaker verification.

Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification

This paper introduces the NAS conception into the well-known x-vector network and proposes an evolutionary algorithm enhanced neural architecture search method called Auto-Vector to automatically discover promising networks for the speaker verification task.

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

  • Yue FanJiawen Kang Dong Wang
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
CN-Celeb is presented, a large-scale speaker recognition dataset collected ‘in the wild’ that contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

VoxCeleb2: Deep Speaker Recognition

A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.

VoxCeleb: A Large-Scale Speaker Identification Dataset

This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.

ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform

This paper proposes a complex speaker embedding extractor, named ICSpk, with higher interpretability and fewer parameters, and demonstrates the IC filters-based system exhibits a sign of improvement over the complex spectrogram based systems.

CACnet: Cube Attentional CNN for Automatic Speech Recognition

A Cube Attention CNN network that uses two different attention blocks to integrate the feature information of different dimensions for extending context information and achieves competitive accuracy while having fewer parameters is proposed.

Federated Learning with Dynamic Transformer for Text to Speech

This paper proposes the federated dynamic transformer, a practical and secure framework for data owners to collaborate with others, thus obtaining a better global model trained on the larger dataset and achieves faster and more stable convergence in the training phase.