• Publications
  • Influence
A time delay neural network architecture for efficient modeling of long temporal contexts
This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.
Audio augmentation for speech recognition
This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.
Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs
Bidirectional long short-term memory (BLSTM) acoustic models provide a significant word error rate reduction compared to their unidirectional counterpart, as they model both the past and future
A study on data augmentation of reverberant speech for robust speech recognition
It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the Framework.
JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS
This paper tackles the problem of reverberant speech recognition using 5500 hours of simulated reverberant data using time-delay neural network (TDNN) architecture, which is capable of tackling long-term interactions between speech and corrupting sources in reverberant environments.
Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline
A new framework for the evaluation of speech rep- resentations in zero-resource settings is presented, that extends and complements previous work by Carlin, Jansen and Hermansky and applies it to de- compose the standard signal processing pipelines for computing PLP and MFC coefficients.
An Exploration of Dropout with LSTMs
This paper describes extensive experiments in which the best way to combine dropout with LSTMs– specifically, projected LST Ms (LSTMP) is investigated, giving consistent improvements in WER across a range of datasets, including Switchboard, TED-LIUM and AMI.
A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition
Centered around the tasks of phonetic and lexical discovery, unified evaluation metrics are considered, two new approaches for improving speaker independence in the absence of supervision are presented, and the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations is evaluated.