Video event detection and summarization using audio, visual and text saliency


Detection of perceptually important video events is formulated here on the basis of saliency models for the audio, visual and textual information conveyed in a video stream. Audio saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color and motion. Text saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The various modality curves are integrated in a single attention curve, where the presence of an event may be signified in one or multiple domains. This multimodal saliency curve is the basis of a bottom-up video summarization algorithm, that refines results from unimodal or audiovisual-based skimming. The algorithm performs favorably for video summarization in terms of informativeness and enjoyability.

DOI: 10.1109/ICASSP.2009.4960393

Extracted Key Phrases

3 Figures and Tables

Cite this paper

@article{Evangelopoulos2009VideoED, title={Video event detection and summarization using audio, visual and text saliency}, author={Georgios Evangelopoulos and Athanasia Zlatintsi and Georgios Skoumas and Konstantinos Rapantzikos and Alexandros Potamianos and Petros Maragos and Yannis S. Avrithis}, journal={2009 IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2009}, pages={3553-3556} }