Visual Semantic Role Labeling for Video Understanding

  title={Visual Semantic Role Labeling for Video Understanding},
  author={Arka Sadhu and Tanmay Gupta and Mark Yatskar and Ramakant Nevatia and Aniruddha Kembhavi},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large scale video understanding data source with 29K 10-second movie clips richly annotated… 

Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World

The novel task of M ulti M odal E vent E vent R elations (M 2 E 2 R) to recognize cross-modal event relations is proposed, and the proposed, MERP (Multimodal Event Relations Predictor) on these pseudo labels while also leveraging commonsense knowledge from an external Knowledge Base is evaluated.

Joint Multimedia Event Extraction from Video and Article

This work introduces the new task of Video MultiMedia Event Extraction (VME) and proposes the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs and introduces the first multimodale transformer which extracts structured event information jointly from both videos and text documents.

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations, and reveals the impact of temporal granularity with three major findings.

Hierarchical Self-supervised Representation Learning for Movie Understanding

This paper proposes a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of the authors' hierarchical movie understanding model (based on [37]), and demonstrates the effectiveness of contextualized event features on LVU tasks.

Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding

A video-language story dataset, Synopses of Movie Narratives (SYMON), containing 5,193 video summaries of popular movies and TV series, which has higher story coverage and more frequent mental-state references than similar videolanguage story datasets.

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

A novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles is proposed, which outperforms other state-of-the-art methods under various metrics.

Effectively leveraging Multi-modal Features for Movie Genre Classification

This work proposes a Multi-Modal approach leveraging shot information 3 , MMShot, to classify video genres in an efficient and effective way and demonstrates the approach’s ability to generalize by evaluating the scene boundary detection task.

How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs

This work uses semi-supervised learning with multiple adverb pseudo-labels to leverage videos with only action labels and gathers adverb annotations for three existing video retrieval datasets, which allows for the new tasks of recognizing adverbs in unseen action-adverb compositions and unseen domains.

MaCLR: Motion-aware Contrastive Learning of Representations for Videos

MaCLR representation can be as effective as representations learned with full supervision on UCF101 and HMDB51 action recognition, and even outperform the supervised representation for action recognition on VidSitu and SSv2, and action detection on AVA.

New Frontiers of Information Extraction

This tutorial will provide audience with a systematic introduction to recent advances of IE, by answering several important research questions, including how to develop an robust IE system from noisy, insufficient training data, while ensuring the reliability of its prediction.



Dense-Captioning Events in Videos

This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.

Large Scale Holistic Video Understanding

A new spatio-temporal deep neural network architecture called "Holistic Appearance and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues is introduced.

Visual Semantic Role Labeling

The problem of Visual Semantic Role Labeling is introduced: given an image the authors want to detect people doing actions and localize the objects of interaction and associate objects in the scene with different semantic roles for each action.

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

Condensed Movies: Story Based Retrieval with Contextual Embeddings

The Condensed Movie Dataset (CMD) is created, consisting of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face tracks, and metadata about the movie.

MovieNet: A Holistic Dataset for Movie Understanding

MovieNet is the largest dataset with richest annotations for comprehensive movie understanding and it is believed that such a holistic dataset would promote the researches on story-based long video understanding and beyond.

Knowledge Graph Extraction from Videos

This paper proposes the new task of knowledge graph extraction from videos, i.e., producing a description in the form of a knowledge graph of the contents of a given video, and includes a method to automatically generate them, starting from datasets where videos are annotated with natural language.

Grounding Semantic Roles in Images

This work renders candidate participants as image regions of objects, and trains a model which learns to ground roles in the regions which depict the corresponding participant, and induces frame—semantic visual representations.

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

Localizing Moments in Video with Natural Language

The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.