Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization


In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tagging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics.

DOI: 10.1109/ICIP.2015.7351630

Extracted Key Phrases

2 Figures and Tables

Cite this paper

@article{Koutras2015PredictingAS, title={Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization}, author={Petros Koutras and Athanasia Zlatintsi and Elias Iosif and Athanasios Katsamanis and Petros Maragos and Alexandros Potamianos}, journal={2015 IEEE International Conference on Image Processing (ICIP)}, year={2015}, pages={4361-4365} }