Learn More
In this paper, we propose a model based on Dynamic Bayesian Networks (DBNs) to integrate information from multiple audio and visual streams. We also compare the DBN based system (implemented using the Graphical Model Toolkit (GMTK)) with a classical HMM (implemented in the Hidden Markov Model Toolkit (HTK)) for both the single and two stream integration(More)
In recent years, the features derived from posteriors of a multilayer perceptron (MLP), known as tandem features, have proven to be very effective for automatic speech recognition. Most tandem features to date have relied on MLPs trained for phone classification. We recently showed on a relatively small data set that MLPs trained for articulatory feature(More)
The so-called tandem approach, where the posteriors of a multilayer perceptron (MLP) classiſer are used as features in an automatic speech recognition (ASR) system has proven to be a very effective method. Most tandem approaches up to date have relied on MLPs trained for phone classiſcation, and appended the posterior features to some standard feature(More)
We address the problem of subselecting a large set of acoustic data to train automatic speech recognition (ASR) systems. To this end, we apply a novel data selection technique based on constrained submodular function maximization. Though NP-hard, the combinatorial optimization problem can be approximately solved by a simple and scalable greedy algorithm(More)
As is well known, exact probabilistic graphical inference requires a triangulated graph. Different triangulations can make exponential differences in complexity, but since finding the optimum is intractable, a wide variety of heuris-tics have been proposed, most involving a vertex elimination ordering. Elimination always yields a triangulated graph, can(More)
In this work we combine a conventional phone-based automatic speech recognizer with a classifier that detects syllable locations. This is done using a dynamic Bayesian network. Using oracle syllable detections we achieve a 17% relative reduction in word error rate on the 500 word task of the SVitchboard corpus. Using estimated locations we achieve a 2.1%(More)
We report on investigations, conducted at the 2006 Johns Hopkins Workshop, into the use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In the area of observation modeling, we use the outputs of AF classifiers both directly, in an extension of hybrid HMM/neural network models, and as part of the observation(More)