Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

@article{Monfort2021MultiMomentsIT,
  title={Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding},
  author={Mathew Monfort and Kandan Ramakrishnan and Alex Andonian and Barry A. McNamara and Alex Lascelles and Bowen Pan and Quanfu Fan and Dan Gutfreund and Rog{\'e}rio Schmidt Feris and Aude Oliva},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  year={2021},
  volume={PP}
}
Videos capture events that typically contain multiple sequential, and simultaneous, actions even in the span of only a few seconds. However, most large-scale datasets built to train models for action recognition in video only provide a single label per video. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information present in each video in training. Towards this goal, we… Expand
Multi-Modal Multi-Action Video Recognition
  • Zhensheng Shi, Ju Liang, +4 authors Bing Zheng
Multi-action video recognition is much more challenging due to the requirement to recognize multiple actions co-occurring simultaneously or sequentially. Modeling multi-action relations is beneficialExpand
A Comprehensive Study of Deep Video Action Recognition
TLDR
A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. Expand
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions
TLDR
This paper aims to present a new multi-person dataset of spatio-temporal localized sports actions, coined as MultiSports, with important properties of strong diversity, detailed annotation, and high quality, and hopes it can serve as a standard benchmark for spatio/temporal action detection in the future. Expand
Video Action Understanding
TLDR
This tutorial introduces and systematizes fundamental topics, basic concepts, and notable examples in supervised video action understanding, and clarifies a taxonomy of action problems, catalog and highlight video datasets, and formalize domain-specific metrics to baseline proposed solutions. Expand
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
  • Mathew Monfort, SouYoung Jin
  • Computer Science, Engineering
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
The Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events is presented and a novel Adaptive Mean Margin (AMM) approach to contrastive learning is presented to evaluate the authors' models on video/caption retrieval on multiple datasets. Expand
Evidential Deep Learning for Open Set Action Recognition
TLDR
A Deep Evidential Action Recognition (DEAR) method to recognize actions in an open testing set and a plug-and-play module to debias the learned representation through contrastive learning to mitigate the static bias of video representation. Expand
Video Action Understanding: A Tutorial
TLDR
This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures. Expand
AR-Net: Adaptive Frame Resolution for Efficient Action Recognition
TLDR
A novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Expand
Cross-Modal Discrete Representation Learning
TLDR
This work presents a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Expand
We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos
TLDR
This work combines visual features with natural language supervision to generate high-level representations of similarities across a set of videos, which allows the model to perform cognitive tasks such as set abstraction, set completion, and odd one out detection. Expand
...
1
2
...

References

SHOWING 1-10 OF 62 REFERENCES
Moments in Time Dataset: One Million Videos for Event Understanding
TLDR
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis. Expand
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset. Expand
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
TLDR
A novel variant of long short-term memory deep networks is defined for modeling these temporal relations via multiple input and output connections and it is shown that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced. Expand
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. Expand
Pulling Actions out of Context: Explicit Separation for Effective Combination
  • Y. Wang, Minh Hoai
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A novel approach for training a human action recognizer that can explicitly factorize human actions from the co-occurring factors, and deliberately build a model for human actions and a separate model for all correlated contextual elements. Expand
ActivityNet: A large-scale video benchmark for human activity understanding
TLDR
This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living. Expand
The Kinetics Human Action Video Dataset
TLDR
The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given. Expand
YouTube-8M: A Large-Scale Video Classification Benchmark
TLDR
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset. Expand
...
1
2
3
4
5
...