Learn More
Hidden markov model Artificial neural network Tandem model Gaussian mixture model supervector a b s t r a c t Acoustic Event Detection (AED) aims to identify both timestamps and types of events in an audio stream. This becomes very challenging when going beyond restricted highlight events and well controlled recordings. We propose extracting discriminative(More)
In this work, we present a SIFT-Bag based generative-to-discriminative framework for addressing the problem of video event recognition in unconstrained news videos. In the generative stage, each video clip is encoded as a bag of SIFT feature vectors, the distribution of which is described by a Gaussian Mixture Models (GMM). In the discriminative stage, the(More)
Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of performing high-level event detection with no training exemplars, using only textual descriptions. This(More)
Combining multiple low-level visual features is a proven and effective strategy for a range of computer vision tasks. However, limited attention has been paid to combining such features with information from other modalities, such as audio and videotext, for large scale analysis of web videos. In our work, we rigorously analyze and combine a large set of(More)
Speech perceptual features, such as Mel-frequency Cepstral Coefficients (MFCC), have been widely used in acoustic event detection. However, the different spectral structures between speech and acoustic events degrade the performance of the speech feature sets. We propose quantifying the discriminative capability of each feature component according to the(More)
Because of the spectral difference between speech and acoustic events, we propose using Kullback-Leibler distance to quantify the dis-criminant capability of all speech feature components in acoustic event detection. Based on these distances, we use AdaBoost to select a discrim-inant feature set and demonstrate that this feature set outperforms classical(More)
High quality speech-to-lips conversion, investigated in this work, renders realistic lips movement (video) consistent with input speech (audio) without knowing its linguistic content. Instead of memoryless frame-based conversion, we adopt maximum likelihood estimation of the visual parameter trajectories using an audiovisual joint Gaussian Mixture Model(More)
We present a system that detects human falls in the home environment, distinguishing them from competing noise, by using only the audio signal from a single far-field microphone. The proposed system models each fall or noise segment by means of a Gaussian mixture model (GMM) supervector, whose Euclidean distance measures the pairwise difference between(More)
Recent studies in patch-based Gaussian Mixture Model (GMM) approaches for face age estimation present promising results. We propose using a hidden Markov model (HMM) supervector to represent face image patches, to improve from the previous GMM super-vector approach by capturing the spatial structure of human faces and loosening the assumption of identical(More)
We describe the Raytheon BBN Technologies (BBN) led VISER system for the TRECVID 2012 Multimedia Event Detection (MED) and Recounting (MER) tasks. We present a comprehensive analysis of the different modules in our evaluation system that includes: (1) a large suite of visual, audio and multimodal low-level features, (2) modules to detect semantic(More)