Corpus ID: 237635068

Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification

  title={Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification},
  author={Xuechen Liu and Md. Sahidullah and Tomi H. Kinnunen},
After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might suppress intrinsic speaker variations that are useful for speaker verification based on deep neural… Expand

Figures and Tables from this paper


A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
This work provides extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets to reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction. Expand
Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition
  • Chanwoo Kim, R. Stern
  • Computer Science
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2016
Experimental results demonstrate that PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for speech in the presence of various types of additive noise and in reverberant environments, with only slightly greater computational cost than conventional MFCC processing. Expand
Attentive Statistics Pooling for Deep Speaker Embedding
Attention statistics pooling for deep speaker embedding in text-independent speaker verification uses an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations, which can capture long-term variations in speaker characteristics more effectively. Expand
Speaker Recognition from Raw Waveform with SincNet
This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters. Expand
Robust speaker recognition based on multi-stream features
  • Ning Wang, Lei Wang
  • Computer Science
  • 2016 IEEE International Conference on Consumer Electronics-China (ICCE-China)
  • 2016
A new method to improving the performance of I-vector based speaker recognition system by combining PNCC and the modified SCF speech feature is proposed to improve the robustness under codec mismatch. Expand
Filterbank Design for End-to-end Speech Separation
The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet and the use of parameterized filterbanks is validated and shows that complex-valued representations and masks are beneficial in all conditions. Expand
Per-Channel Energy Normalization: Why and How
This letter investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints and describes the asymptotic regimes in PCEN: temporal integration, gain control, and dynamic range compression. Expand
Deep Speaker: an End-to-End Neural Speaker Embedding System
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline. Expand
Speaker Recognition for Multi-speaker Conversations Using X-vectors
It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Expand
A study on data augmentation of reverberant speech for robust speech recognition
It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario. Expand