Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition


We examine the use of Deep Neural Networks (DNN) in extracting Baum-Welch statistics for i-vector-based textindependent speaker recognition. Instead of training the universal background model using the standard EM algorithm, the components are predefined and correspond to the set of triphone states, the posterior occupancy probabilities of which are modeled by a DNN. Those assignments are then combined with the standard 60-dim MFCC features to calculate first order BaumWelch statistics in order to train the i-vector extractor and extract i-vectors. The DNN-based assignment force the i-vectors to capture the idiosyncratic way in which each speaker pronounces each particular triphone state, which can enrich the standard short-term spectral representation of the standard ivectors. After experimenting with Switchboard data and a baseline PLDA classifier, our results showed that although the proposed i-vectors yield inferior performance compared to the standard ones, they are capable of attaining 16% relative improvement when fused with them, meaning that they carry useful complementary information about the speaker’s identity. A further experiment with a different DNN configuration attained comparable performance with the baseline i-vectors on NIST 2012 (condition C2, female).

8 Figures and Tables

Citations per Year

112 Citations

Semantic Scholar estimates that this publication has 112 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Kenny2014DeepNN, title={Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition}, author={P. Kenny and V. Gupta and T. Stafylakis and P. Ouellet and J. Alam}, year={2014} }