Models of speech recognition (by both human and machine) have traditionally assumed the phoneme to serve as the fundamental unit of phonetic and phonological analysis. However, phoneme-centric models have failed to provide a convincing theoretical account of the process by which the brain extracts meaning from the speech signal and have fared poorly in… (More)
In collaboration with colleagues at UW, OGI, IBM, and SRI, we are developing technology to process spoken language from informal meetings. The work includes a substantial data collection and transcription effort, and has required a nontrivial degree of infrastructure development. We are undertaking this because the new task area provides a significant… (More)
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decor-related acoustic feature vectors that correspond to individual sub-word units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units… (More)
A beat-synchronous chroma representation enables the matching of cover versions of popular music using global cross-correlation across time-and transposition-skew.
This paper provides an overview of current state-of-the-art approaches for melody extraction from polyphonic audio recordings, and it proposes a methodology for the quantitative evaluation of melody extraction algorithms. We first define a general architecture for melody extraction systems and discuss the difficulties of the problem in hand; then, we review… (More)
We investigate the challenging issue of joint audiovisual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term AudioVisual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio… (More)
The development of reliable measures of confidence for the decoding of speech sounds by machine has the potential to greatly enhance the 'state-of-the-art' in the field of automatic speech recognition (ASR). This dissertation describes the derivation of several complimentary confidence measures from a so-called acceptor hidden Markov model (HMM) based large… (More)
Building machines that emulate the kinds of acoustic information processing that human beings take for granted has proved unexpectedly difficult; the human auditory system is extremely sophisticated in its adaptation to the sounds of the real world, and uses an impressive array of features as cues to organization and interpretation. As more of these cues… (More)
A neural net classifier is trained to identify the pitch of a frame of subband autocorrelation principal components. Accuracy is greatly improved for noisy, bandlimited speech, matched to the training data.