When trying to overcome the significant performance drops of ASR systems in the presence of noise, one road to follow is the integration of the information present in the lips movement of the speaker. Comparisons showed that integration of audio and video data on the decision level yields best recognition results. This raises the question how to weight the(More)
Encouraged by the good performance of the DCT in audiovisual speech recognition [1], we investigate how the selection of the DCT coefficients influences the recognition scores in a hybrid ANN/HMM audiovisual speech recognition system on a continuous word recognition task with a vocabulary of 30 numbers. Three sets of coefficients, based on the mean energy,(More)
Abstract In this paper we present a hierarchical framework for the extraction of(More)
In this paper we present a system for audiovisual speech recognition based on a hybrid Artificial Neu-ral Network/Hidden Markov Model (ANN/HMM) approach. To setup the system it was necessary to record a new audiovisual database. We will describe the recording and labeling of the database. The fusion of audio and video data is a key aspect of the paper.(More)
We present a sound localization system that operates in real-time, calculates three binaural cues (IED, UD, and ITD) and integrates them in a biologically inspired fashion to a combined localization estimation. Position information is furthermore integrated over frequency channels and time. The localization system controls a head motor to fovealize on and(More)
In this paper we propose an algorithm for the robust extraction of pitch combining both temporal (rate) and pattern matching (place) techniques. Following a transformation into the spectral domain via the application of a Gammatone filter bank the rate information is extracted in each band via the zero crossing distances in that band. Next a comb filter(More)
We investigate the fusion of audio and video a posteriori phonetic probabilities in a hybrid ANN/HMM audiovisual speech recognition system. Three basic conditions to the fusion process are stated and implemented in a linear and a geometric weighting scheme. These conditions are the assumption of conditional independence of the audio and video data and the(More)
This paper investigates the audiovisual correlates and the detection of word prominence. Subjects were interacting with a computer in a small game which created a broad and a narrow focus condition. Audiovisual recordings with a distant microphone and without visual markers were made. As acoustic features duration, intensity, fundamental frequency and(More)
We present a framework for estimating formant trajectories. Its focus is to achieve high robustness in noisy environments. Our approach combines a preprocessing based on functional principles of the human auditory system and a probabilistic tracking scheme. For enhancing the formant structure in spectrograms we use a Gammatone filterbank, a spectral(More)