CNN-based joint mapping of short and long utterance i-vectors for speaker verification using short utterances
I-vector has shown to be very effective in speaker verification with long-duration speech utterances. But when test utterances are of short duration, content mismatch between the enrollment and test utterances limit the performance of i-vector system. This paper proposes to extract local session variability vectors on different phonetic classes from the utterances instead of estimating the session variability across the whole utterance as i-vector does. Using the posteriors given by a deep neural network (DNN) trained for phone state classification, the local vectors represent the session variability contained in specific phonetic content. Our experiments show that the content-aware local vectors are better at coping with the content mismatch between training and test utterances of short durations for text-independent, text-constrained and text-dependent tasks.