• Publications
  • Influence
Librispeech: An ASR corpus based on public domain audio books
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
The Kaldi Speech Recognition Toolkit
The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
X-Vectors: Robust DNN Embeddings for Speaker Recognition
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
The HTK book version 3.4
A time delay neural network architecture for efficient modeling of long temporal contexts
This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.
Minimum Phone Error and I-smoothing for improved discriminative training
The Minimum Phone Error (MPE) and Minimum Word Error (MWE) criteria are smoothed approximations to the phone or word error rate respectively and I-smoothing which is a novel technique for smoothing discriminative training criteria using statistics for maximum likelihood estimation (MLE).
MUSAN: A Music, Speech, and Noise Corpus
This report introduces a new corpus of music, speech, and noise suitable for training models for voice activity detection (VAD) and music/speech discrimination and demonstrates use of this corpus on Broadcast news and VAD for speaker identification.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
Boosted MMI for model and feature-space discriminative training
A modified form of the maximum mutual information (MMI) objective function which gives improved results for discriminative training by boosting the likelihoods of paths in the denominator lattice that have a higher phone error relative to the correct transcript.