• Publications
  • Influence
Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study
TLDR
This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. Expand
  • 98
  • 8
Discovery and fusion of salient multimodal features toward news story segmentation
TLDR
In this paper, we present our new results in news video story segmentation and classification in the context of TRECVID 2003 benchmarking event 2003. Expand
  • 73
  • 7
  • PDF
A cascade image transform for speaker independent automatic speechreading
TLDR
We propose a three-stage pixel based visual front end for automatic speechreading (lipreading) that results in improved recognition performance of spoken words or phonemes. Expand
  • 65
  • 7
Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues
TLDR
We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. Expand
  • 161
  • 5
  • PDF
A Cascade Visual Front End for Speaker Independent Automatic Speechreading
TLDR
We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. Expand
  • 45
  • 4
  • PDF
IBM Research TREC 2002 Video Retrieval System
TLDR
In this paper, we describe the IBM Research system for analysis, indexing, and retrieval of video, which was applied to the TREC-2002 video retrieval benchmark. Expand
  • 70
  • 4
  • PDF
Discriminative model fusion for semantic concept detection and annotation in video
TLDR
We describe a general information fusion algorithm that can be used to incorporate multimodal cues in building user-defined semantic concept models. Expand
  • 86
  • 3
  • PDF
Audio-visual synchrony for detection of monologues in video archives
TLDR
In this paper we present our approach to detect monologues in video shots. Expand
  • 44
  • 3
Assessing face and speech consistency for monologue detection in video
TLDR
This paper considers schemes for determining which of a set of faces on screen, if any, is producing speech in a video soundtrack. Expand
  • 55
  • 3
Joint visual-text modeling for automatic retrieval of multimedia documents
TLDR
We propose a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). Expand
  • 47
  • 2
  • PDF