• Corpus ID: 237634970

Parameterized Channel Normalization for Far-field Deep Speaker Verification

  title={Parameterized Channel Normalization for Far-field Deep Speaker Verification},
  author={Xuechen Liu and Md. Sahidullah and Tomi H. Kinnunen},
We address far-field speaker verification with deep neural network (DNN) based speaker embedding extractor, where mismatch between enrollment and test data often comes from convolutive effects (e.g. room reverberation) and noise. To mitigate these effects, we focus on two parametric normalization methods: per-channel energy normalization (PCEN) and parameterized cepstral mean normalization (PCMN). Both methods contain differentiable parameters and thus can be conveniently integrated to, and… 

Figures and Tables from this paper


Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment
A multi-channel training framework for the deep speaker embedding neural network on noisy and reverberant data and demonstrates that the proposed method obtains significant improvements over the single-channel trained deep speakers embedding system with front end speech enhancement or multichannel embedding fusion.
HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines
  • Xiaoyi Qin, Hui Bu, Ming Li
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A far-field text-dependent speaker verification database named HI-MIA is presented and a set of end-to-end neural network based baseline systems that adopt single-channel data for training are proposed.
Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments
It is shown that rank-1 approximation of a speech covariance matrix based on generalized eigenvalue decomposition leads to the best results for the masking-based MVDR beamformer.
Parametric Cepstral Mean Normalization for Robust Speech Recognition
Experimental results show that, in contrast to traditional CMN, which degrades performance on clean data, PCMN provides 5% relative improvement onclean data, while also providing 11.2% relative improved on far-field test data.
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge
The database, the challenge, and the baseline system are described, which is based on a ResNet-based deep speaker network with cosine similarity scoring, which achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.
Trainable frontend for robust and far-field keyword spotting
This work introduces a novel frontend called per-channel energy normalization (PCEN), which uses an automatic gain control based dynamic compression to replace the widely used static compression in speech recognition.
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
This work provides extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets to reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction.
Per-Channel Energy Normalization: Why and How
This letter investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints and describes the asymptotic regimes in PCEN: temporal integration, gain control, and dynamic range compression.
Utilizing VOiCES Dataset for Multichannel Speaker Verification with Beamforming
A multichannel dataset as well as development and evaluation trials for SV inspired by the VOiCES challenge are designed and the utilization of the created dataset for x-vector based SV with beamforming as a front end is assesed.
Far-Field End-to-End Text-Dependent Speaker Verification Based on Mixed Training Data with Transfer Learning and Enrollment Data Augmentation
It is shown that simulating far-field text independent data from the existing large-scale clean database for data augmentation can reduce the mismatch and using a small far-Field text dependent data set to finetune the deep speaker embedding model pre-trained from the simulated far- field as well as original clean text independentData can significantly improve the system performance.