• Corpus ID: 237635068

Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification

  title={Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification},
  author={Xuechen Liu and Md. Sahidullah and Tomi H. Kinnunen},
After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might suppress intrinsic speaker variations that are useful for speaker verification based on deep neural… 

Figures and Tables from this paper

Tiny, always-on and fragile: Bias propagation through design choices in on-device machine learning workflows
Billions of distributed, heterogeneous and resource constrained smart consumer devices deploy on-device machine learning (ML) to deliver private, fast and offline inference on personal data.


A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
This work provides extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets to reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction.
Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition
  • Chanwoo Kim, R. Stern
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2016
Experimental results demonstrate that PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for speech in the presence of various types of additive noise and in reverberant environments, with only slightly greater computational cost than conventional MFCC processing.
Attentive Statistics Pooling for Deep Speaker Embedding
Attention statistics pooling for deep speaker embedding in text-independent speaker verification uses an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations, which can capture long-term variations in speaker characteristics more effectively.
Speaker Recognition from Raw Waveform with SincNet
This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.
Robust speaker recognition based on multi-stream features
  • Ning Wang, Lei Wang
  • Computer Science
    2016 IEEE International Conference on Consumer Electronics-China (ICCE-China)
  • 2016
A new method to improving the performance of I-vector based speaker recognition system by combining PNCC and the modified SCF speech feature is proposed to improve the robustness under codec mismatch.
Filterbank Design for End-to-end Speech Separation
The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet and the use of parameterized filterbanks is validated and shows that complex-valued representations and masks are beneficial in all conditions.
Per-Channel Energy Normalization: Why and How
This letter investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints and describes the asymptotic regimes in PCEN: temporal integration, gain control, and dynamic range compression.
Deep Speaker: an End-to-End Neural Speaker Embedding System
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
Speaker Recognition for Multi-speaker Conversations Using X-vectors
It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.
A study on data augmentation of reverberant speech for robust speech recognition
It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.