MFA: TDNN with Multi-Scale Frequency-Channel Attention for Text-Independent Speaker Verification with Short Utterances

@article{Liu2022MFATW,
  title={MFA: TDNN with Multi-Scale Frequency-Channel Attention for Text-Independent Speaker Verification with Short Utterances},
  author={Tianchi Liu and Rohan Kumar Das and Kong-Aik Lee and Haizhou Li},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2022},
  pages={7517-7521}
}
  • Tianchi LiuRohan Kumar Das Haizhou Li
  • Published 3 February 2022
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel… 

Figures and Tables from this paper

Investigation of feature processing modules and attention mechanisms in speaker verification system

This paper replaces and integrate different feature front-end and attention mechanism modules to compare and find the most effective model design, and this model would be the final system.

Convolution-Based Channel-Frequency Attention for Text-Independent Speaker Verification

The proposed C2D-Att is effective in generating discriminative attention maps and outperforms other attention methods, shows robust performance with different scales of model size and achieves state-of-the-art results.

Selective Kernel Attention for Robust Speaker Verification

This work proposes three module variants using the SKA mechanism whereby two modules are applied in front of an ECAPA-TDNN model, and the other is combined with the Res2Net backbone block to outperforms the conventional counterpart on the three different evaluation protocols.

Generalization Ability Improvement of Speaker Representation and Anti-Interference for Speaker Verification

Two novel approaches to improve the generalization ability to deal with the mismatched recorded scenarios and languages in test conditions and to reduce the influence of interference from other speakers on the similarity measurement of two speaker embeddings are proposed.

Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification

This study proposes two SKA variants where the first variant is applied in front of the ECAPA-TDNN model and the other is combined with the Res2net backbone block and demonstrates that they consistently improves the performance and are complementary when tested on three different evaluation protocols.

Attention enhanced dynamic kernel convolution for TDNN-based speaker verification

  • Xiaofan LangYa Li
  • Computer Science
    Conference on Computer Science and Communication Technology
  • 2022
A dynamic kernel convolution module to extract features from short-term and long-term context adaptively, thus achieving multi-scale receptive fields and three enhanced attention modules instead of plain Squeeze-Excitation layer to realize more efficient information interaction between channels and spaces are proposed.

Model Compression for DNN-Based Text-Independent Speaker Verification Using Weight Quantization

Weight quantization is exploited to compress DNN-based speaker embedding extraction models and demonstrates that quantized models are robust for SV in the language mismatch scenario.

Speaker recognition with two-step multi-modal deep cleansing

A two-step audio-visual deep cleansing framework to eliminate the ef-fect of noisy labels in speaker representation learning and four different speaker recognition networks achieve an average improvement of 5.9%.

BSML: Bidirectional Sampling Aggregation Based Metric Learning for Low-Resource Uyghur Few-Shot Speaker Verification

  • Yunfei ZiShengwu Xiong
  • Computer Science
    ACM Transactions on Asian and Low-Resource Language Information Processing
  • 2022
The experimental result has shown that the metric learning approach is effective in avoiding model overfitting and improving model generalization, with significant results in the identification of short-duration speaker verification in low-resource Uyghur with few-shot.

The Clips System for Spoofing-Aware Speaker Verification Challenge 2022

The results show that the proposed fusion system can significantly improve the SASV equal error rate (SASV-EER) from 6.37 % to 1.36 % on the evaluation dataset and 4.85 % to 0.98% on the development dataset.

References

SHOWING 1-10 OF 38 REFERENCES

Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification

A 2D convolutional stem in a strong ECAPA-TDNN baseline is introduced to transfer some of the strong characteristics of a ResNet based model to this hybrid CNN- TDNN architecture and a frequency-wise variant of Squeeze-Excitation which better preserves frequency-specific information when rescaling the feature maps is proposed.

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

VoxCeleb: A Large-Scale Speaker Identification Dataset

This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.

Neural Acoustic-Phonetic Approach for Speaker Verification With Phonetic Attention Mask

This paper evaluates the proposed neural acoustic-phonetic framework on the RSR2015 database Part III corpus, that consists of random digit strings, and shows that the proposed framework with PAM consistently outperforms baseline.

PL-EESR: Perceptual Loss Based End-to-End Robust Speaker Representation Extraction

Compared to the baseline, the proposed end-to-end deep learning framework, dubbed PL-EESR, shows better performance in both clean and noisy environments, which means the method can not only enhance the speaker relative information but also avoid adding distortions.

Xi-Vector Embedding for Speaker Recognition

A Bayesian formulation for deep speaker embedding, wherein the xi-vector is the Bayesian counterpart of the x-vector, taking into account the uncertainty estimate, is presented, which leads to substantial improvement across all operating points.

The ins and outs of speaker recognition: lessons from VoxSRC 2020

This work utilises variants of the popular ResNet architecture for speaker recognition and performs extensive experiments using a range of loss functions and training parameters to optimise an efficient training framework that allows powerful models to be trained with limited time and resources.

ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification

RET integrates shortcut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation to solve the problem of long-term temporal features of speakers.

Dynamic Margin Softmax Loss for Speaker Verification

A dynamic-margin softmax loss for the training of deep speaker embedding neural network that dynamically set the margin of each training sample commensurate with the cosine angle of that sample, hence, the name dynamic-additive margin softmax (DAM-Softmax) loss.

Speaker-Utterance Dual Attention for Speaker and Utterance Verification

A novel technique that exploits the interaction between speaker traits and linguistic content to improve both speaker verification and utterance verification performance and is implemented in a unified neural network.