Visual Semantic Role Labeling for Video Understanding

  title={Visual Semantic Role Labeling for Video Understanding},
  author={Arka Sadhu and Tanmay Gupta and Mark Yatskar and Ramakant Nevatia and Aniruddha Kembhavi},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large scale video understanding data source with 29K 10-second movie clips richly annotated… 
Joint Multimedia Event Extraction from Video and Article
This work introduces the new task of Video MultiMedia Event Extraction (VME) and proposes the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs and introduces the first multimodale transformer which extracts structured event information jointly from both videos and text documents.
Exploring Temporal Granularity in Self-Supervised Video Representation Learning
This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations, and reveals the impact of temporal granularity with three major findings.


Dense-Captioning Events in Videos
This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.
Visual Semantic Role Labeling
The problem of Visual Semantic Role Labeling is introduced: given an image the authors want to detect people doing actions and localize the objects of interaction and associate objects in the scene with different semantic roles for each action.
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.
MovieNet: A Holistic Dataset for Movie Understanding
MovieNet is the largest dataset with richest annotations for comprehensive movie understanding and it is believed that such a holistic dataset would promote the researches on story-based long video understanding and beyond.
Knowledge Graph Extraction from Videos
This paper proposes the new task of knowledge graph extraction from videos, i.e., producing a description in the form of a knowledge graph of the contents of a given video, and includes a method to automatically generate them, starting from datasets where videos are annotated with natural language.
Grounding Semantic Roles in Images
This work renders candidate participants as image regions of objects, and trains a model which learns to ground roles in the regions which depict the corresponding participant, and induces frame—semantic visual representations.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Localizing Moments in Video with Natural Language
The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.
A dataset for Movie Description
Comparing ADs to scripts, it is found that ADs are far more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production.