Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention


Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher-level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color and orientation. Textual or linguistic saliency is extracted from part-ofspeech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided Manuscript received November 16, 2011; revised July 25, 2012, October 22, 2012; accepted January 07, 2013. Date of publication nulldate; date of current version nulldate. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Christophe De Vleeschouwer. Copyright c ⃝ 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to G. Evangelopoulos*, A. Zlatintsi, P. Maragos, K. Rapantzikos, G. Skoumas and Y. Avrithis are with the School of Electrical and Computer Engineering, National Technical University of Athens, Athens GR-15773, Greece (e-mail: [gevag, nzlat, maragos], [rap,iavr], A. Potamianos is with the Department of Electronics and Computer Engineering, Technical University of Crete, Chania GR-73100, Greece (e-mail: This research was partially supported by: (1) the project “COGNIMUSE” which is implemented under the “ARISTEIA” Action of the Operational Program “Education and Lifelong Learning” and is co-funded by the European Social Fund (ESF) and National Resources; (2) the European Union (ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework Research Funding Program: Heracleitus II; (3) the EU project DIRHA with grant FP7-ICT-2011-7-288121. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier 10.1109/TMM.2013.XXXXXXXX

DOI: 10.1109/TMM.2013.2267205

Extracted Key Phrases

10 Figures and Tables

Citations per Year

92 Citations

Semantic Scholar estimates that this publication has 92 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Evangelopoulos2013MultimodalSA, title={Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention}, author={Georgios Evangelopoulos and Athanasia Zlatintsi and Alexandros Potamianos and Petros Maragos and Konstantinos Rapantzikos and Georgios Skoumas and Yannis S. Avrithis}, journal={IEEE Trans. Multimedia}, year={2013}, volume={15}, pages={1553-1568} }