Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos

Abstract

Videos are rich in multimedia content and semantics, which should be used by video browsers to better present the audio-visual information to the viewer. Ubiquitous video players allow for content to be scanned linearly, rarely providing summaries or methods for searching. Through analysis of audio and video tracks, it is possible to extract text transcripts from audio, displayed text from video, and higher-level semantics through speaker identification and scene analysis. External data sources, when available, can be used to cross-reference the video content and impose a structure for organization. Various research tools have addressed video summarization and browsing using one or more of these modalities; however, most of them assume edited videos as input. We focus our research on genres in personal interaction videos and collections of such videos in their unedited form. We present and verify formal models for their structure, and develop methods for their automatic analysis, summarization and indexing. We specify the characteristic semantic components of three related genres of candidly captured videos: formal instructions or lectures, student team project presentations, and discussions. For each genre, we design and validate a separate multi-modal approach to the segmentation and structuring of their content. We develop novel user interfaces to support browsing and searching the multi-modal video information, and introduce the tool in a classroom environment with ≈160 students per semester. UI elements are designed according to the underlying video structure to address video browsing in a structured multi-modal space. These user interfaces include image/video browsers, audio/video segmentation browsers, and text/filtered ASR transcript browsers. Through several user studies, we evaluate and refine our indexing methods, browser interface, and the tools usefulness in the classroom. We propose a core/module methodology to analysis, structure, and visualization of personal interaction videos. Analysis, structure, and visualization techniques in the core are common to all genres. Modular features are characteristic to video genres, and are applied selectively. Structure of interactions in each video is derived from the combination of the resulting audio, visual, and textual features. We expect that the framework can be applied to genres not covered here with the addition or replacement of few characteristic modules.

View Slides

Cite this paper

@inproceedings{Haubold2006SemanticMA, title={Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos}, author={Alexander Haubold}, year={2006} }