Corpus ID: 207780280

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

@article{Monfort2019MultiMomentsIT,
  title={Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding},
  author={Mathew Monfort and Kandan Ramakrishnan and Alex Andonian and Barry A. McNamara and Alex Lascelles and Bowen Pan and Quanfu Fan and Dan Gutfreund and Rog{\'e}rio Schmidt Feris and Aude Oliva},
  journal={ArXiv},
  year={2019},
  volume={abs/1911.00232}
}
An event happening in the world is often made of different activities and actions that can unfold simultaneously or sequentially within a few seconds. However, most large-scale datasets built to train models for action recognition provide a single label per video clip. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information that would be mandatory to more completely… Expand

Figures and Tables from this paper

MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions
TLDR
This paper aims to present a new multi-person dataset of spatio-temporal localized sports actions, coined as MultiSports, with important properties of strong diversity, detailed annotation, and high quality, and hopes it can serve as a standard benchmark for spatio/temporal action detection in the future. Expand
A Comprehensive Study of Deep Video Action Recognition
TLDR
A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. Expand
Evidential Deep Learning for Open Set Action Recognition
TLDR
A Deep Evidential Action Recognition (DEAR) method to recognize actions in an open testing set and a plug-and-play module to debias the learned representation through contrastive learning to mitigate the static bias of video representation. Expand
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
TLDR
The Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events is presented and a novel Adaptive Mean Margin (AMM) approach to contrastive learning is presented to evaluate the authors' models on video/caption retrieval on multiple datasets. Expand
Video Action Understanding
TLDR
This tutorial introduces and systematizes fundamental topics, basic concepts, and notable examples in supervised video action understanding, and clarifies a taxonomy of action problems, catalog and highlight video datasets, and formalize domain-specific metrics to baseline proposed solutions. Expand
Video Action Understanding: A Tutorial
TLDR
This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures. Expand
AR-Net: Adaptive Frame Resolution for Efficient Action Recognition
TLDR
A novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Expand
We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos
TLDR
This work combines visual features with natural language supervision to generate high-level representations of similarities across a set of videos, which allows the model to perform cognitive tasks such as set abstraction, set completion, and odd one out detection. Expand
Cross-Modal Discrete Representation Learning
TLDR
This work presents a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Expand
Audiovisual Classification of Group Emotion Valence Using Activity Recognition Networks
TLDR
The results show that using activity recognition pretraining offers performance advantages for group-emotion recognition and that audio is essential to improve the accuracy and robustness of video-based recognition. Expand
...
1
2
...

References

SHOWING 1-10 OF 62 REFERENCES
Moments in Time Dataset: One Million Videos for Event Understanding
TLDR
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis. Expand
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
TLDR
A novel variant of long short-term memory deep networks is defined for modeling these temporal relations via multiple input and output connections and it is shown that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction. Expand
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset. Expand
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced. Expand
Pulling Actions out of Context: Explicit Separation for Effective Combination
  • Y. Wang, Minh Hoai
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A novel approach for training a human action recognizer that can explicitly factorize human actions from the co-occurring factors, and deliberately build a model for human actions and a separate model for all correlated contextual elements. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. Expand
Temporal Relational Reasoning in Videos
TLDR
This paper introduces an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. Expand
ActivityNet: A large-scale video benchmark for human activity understanding
TLDR
This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living. Expand
The Kinetics Human Action Video Dataset
TLDR
The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given. Expand
...
1
2
3
4
5
...