Learn More
Models of speech recognition (by both human and machine) have traditionally assumed the phoneme to serve as the fundamental unit of phonetic and phonological analysis. However, phoneme-centric models have failed to provide a convincing theoretical account of the process by which the brain extracts meaning from the speech signal and have fared poorly in(More)
In collaboration with colleagues at UW, OGI, IBM, and SRI, we are developing technology to process spoken language from informal meetings. The work includes a substantial data collection and transcription effort, and has required a nontrivial degree of infrastructure development. We are undertaking this because the new task area provides a significant(More)
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given(More)
A room may be used for a wide variety of performances and presentations. Each use places different acoustical requirements on the room. We desire a method of electronically controlling the acoustical properties of a room so that one physical space can accommodate various uses. A virtual acoustic room is a room equipped with speakers, microphones and signal(More)
This paper provides an overview of current state-of-the-art approaches for melody extraction from polyphonic audio recordings, and it proposes a methodology for the quantitative evaluation of melody extraction algorithms. We first define a general architecture for melody extraction systems and discuss the difficulties of the problem in hand; then, we review(More)
We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio(More)
This paper describes the SPRACH system developed for the 1998 Hub-4E broadcast news evaluation. The system is based on the connectionist-HMM framework and uses both recurrent neural network and multi-layer perceptron acoustic models. We describe both a system designed for the primary transcription hub, and a system for the less-than 10 times real-time(More)
Semantic indexing of images and videos in the consumer domain has become a very important issue for both research and actual application. In this work we developed Kodak’s consumer video benchmark data set, which includes (1) a significant number of videos from actual users, (2) a rich lexicon that accommodates consumers’ needs, and (3) the annotation of a(More)
One approach to speaker adaptation for the neural-network acoustic models of a hybrid connectionist-HMM speech recognizer is to adapt a speaker-independent network by performing a small amount of additional training using data from the target speaker, giving an acoustic model specifically tuned to that speaker. This adapted model might be useful for speaker(More)
The growth of widespread use of the World-Wide Web (WWW) in recent years has spurred researchers in human– computer interaction to focus on usability issues regarding hypermedia (Cockburn & Jones, 1996). Important usability issues arise when one considers how people will navigate through this immense pool of hypermediabased information. To help investigate(More)