Learn More
Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning , and show that by carefully optimizing our systems using(More)
This paper investigates data augmentation for deep neural network acoustic modeling based on label-preserving transformations to deal with data sparsity. Two data augmentation approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM), are investigated for both deep neural networks (DNNs) and convolutional neural networks(More)
A standard approach to automatic speech recognition uses hidden Markov models whose state dependent distributions are Gaussian mixture models. Each Gaussian can be viewed as an exponential model whose features are linear and quadratic monomials in the acoustic vector. We consider here models in which the weight vectors of these exponential models are(More)
Linear transforms are often used for adaptation to test data in speech recognition systems. However, when used with small amounts of test data, these techniques provide limited improvements if any. This paper proposes a two-step Bayesian approach where a) the transforms lie in a subspace obtained at training time and b) the expansion coefficients of the(More)
In this paper, we study discriminative training of acoustic models for speech recognition under two criteria: maximum mutual information (MMI) and a novel "error-weighted" training technique. We present a proof that the standard MMI training technique is valid for a very general class of acoustic models with any kind of parameter tying. We report(More)
This paper applies the recently proposed Extended Maximum Likelihood Linear Transformation (EMLLT) model in a Speaker Adaptive Training (SAT) context on the Switchboard database. Adaptation is carried out with maximum likelihood estimation of linear transforms for the means, precisions (inverse covariances) and the feature-space under the EMLLT model. This(More)
Minimum Bayes Risk (MBR) speech recognizers have been shown to yield improvements over the conventional maximum a-posteriori probability (MAP) decoders in the context of N-best list rescoring and A search over recognition lattices. Seg-mental MBR (SMBR) procedures have been developed to simplify implementation of MBR recognizers, by segmenting the N-best(More)
This paper applies the recently proposed SPAM models for acoustic modeling in a Speaker Adaptive Training (SAT) context on large vocabulary conversational speech databases, including the Switchboard database. SPAM models are Gaus-sian mixture models in which a subspace constraint is placed on the precision and mean matrices (although this paper fo-cuses on(More)