Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

  title={Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding},
  author={Mathew Monfort and Kandan Ramakrishnan and Alex Andonian and Barry A. McNamara and Alex Lascelles and Bowen Pan and Quanfu Fan and Dan Gutfreund and Rog{\'e}rio Schmidt Feris and Aude Oliva},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
Videos capture events that typically contain multiple sequential, and simultaneous, actions even in the span of only a few seconds. However, most large-scale datasets built to train models for action recognition in video only provide a single label per video. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information present in each video in training. Towards this goal, we… 

Multi-Modal Multi-Action Video Recognition

This paper proposes a novel multi-action relation model for videos, by leveraging both relational graph convolutional networks (GCNs) and video multi-modality, and achieves state-of-the-art performance on large-scale multi- action M-MiT benchmark.

A Comprehensive Study of Deep Video Action Recognition

A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models.

MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions

This paper aims to present a new multi-person dataset of spatio-temporal localized sports actions, coined as MultiSports, which it hopes can serve as a standard benchmark for spatio/temporal action detection in the future.

GabriellaV2: Towards better generalization in surveillance videos for Action Detection

This work proposes a real-time, online, action detection system which can generalize robustly on any unknown facility surveillance videos and gets state-of-the-art performance on ActEV-SDL UF-full dataset and second place in TRECVID 2021 ActEV challenge.

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

The Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events is presented and a novel Adaptive Mean Margin (AMM) approach to contrastive learning is presented to evaluate the authors' models on video/caption retrieval on multiple datasets.

Evidential Deep Learning for Open Set Action Recognition

  • Wentao BaoQi YuYu Kong
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
A Deep Evidential Action Recognition (DEAR) method to recognize actions in an open testing set and a plug-and-play module to debias the learned representation through contrastive learning to mitigate the static bias of video representation.

Video Action Understanding: A Tutorial

This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures.

AR-Net: Adaptive Frame Resolution for Efficient Action Recognition

A novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos.

Prompting Visual-Language Models for Efficient Video Understanding

This paper proposes to optimise a few random vectors, termed as “continuous prompt vectors”, that convert video-related tasks into the same format as the pre-training objectives, and exploits its powerful ability for resource-hungry video understanding tasks, with minimal training.

Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection

This work proposes an approach for human action detection in videos that transfers knowledge from the source domain to the target domain using mixed sampling and pseudo-label-based self-training and demonstrates that DA-AIM consistently outperforms prior works on challenging domain adaptation benchmarks.



Moments in Time Dataset: One Million Videos for Event Understanding

The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

A novel variant of long short-term memory deep networks is defined for modeling these temporal relations via multiple input and output connections and it is shown that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

Pulling Actions out of Context: Explicit Separation for Effective Combination

  • Y. WangMinh Hoai
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
A novel approach for training a human action recognizer that can explicitly factorize human actions from the co-occurring factors, and deliberately build a model for human actions and a separate model for all correlated contextual elements.

ActivityNet: A large-scale video benchmark for human activity understanding

This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living.

The Kinetics Human Action Video Dataset

The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.

YouTube-8M: A Large-Scale Video Classification Benchmark

YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset.