AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

@article{Gu2018AVAAV,
  title={AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions},
  author={Chunhui Gu and Chen Sun and Sudheendra Vijayanarasimhan and Caroline Pantofaru and David A. Ross and George Toderici and Yeqing Li and Susanna Ricco and Rahul Sukthankar and Cordelia Schmid and Jitendra Malik},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={6047-6056}
}
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Published 23 May 2017
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA. [...] Key Result While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.8% mAP, underscoring the need for developing new approaches for video understanding.Expand
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions
TLDR
This paper aims to present a new multi-person dataset of spatio-temporal localized sports actions, coined as MultiSports, with important properties of strong diversity, detailed annotation, and high quality, and hopes it can serve as a standard benchmark for spatio/temporal action detection in the future. Expand
FineAction: A Fine-Grained Video Dataset for Temporal Action Localization
TLDR
A novel large-scale and fine-grained video dataset, coined as FineAction, that introduces new opportunities and challenges for temporal action localization, thanks to its distinct characteristics of fine action classes with rich diversity, dense annotations of multiple instances, and co-occurring actions of different classes. Expand
STAGE: Spatio-Temporal Attention on Graph Entities for Video Action Detection
TLDR
A high-level video understanding module which can encode interactions between actors and objects both in space and time is developed which can outperform or bring performances comparable to state-of-the-art models which require heavy end-to-end and synchronized training on multiple GPUs. Expand
Long term spatio-temporal modeling for action detection
TLDR
A Graph Neural Network is proposed that explicitly models spatial and temporal states for each person instance and learns to effectively combine information from both modalities to make predictions at the same time, achieving state-of-the-art performance without any fine-tuning. Expand
Temporal localization of actions in untrimmed videos
Action recognition is the process of identifying actions performed by one or more actor/s in a given context based on some observations. Actions come in all shapes and sizes, be it a simple actionExpand
Moments in Time Dataset: One Million Videos for Event Understanding
TLDR
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis. Expand
Spatio-Temporal Action Detection with Multi-Object Interaction
TLDR
This paper introduces a new dataset that is annotated with action tubes containing multi-object interactions, and proposes an end-to-end spatio-temporal action detection model that performs both spatial and temporal regression simultaneously. Expand
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
  • Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted. Expand
STEP: Spatio-Temporal Progressive Learning for Video Action Detection
TLDR
Compared to the prior work that performs action detection in one run, the proposed Spatio-TEmporal Progressive action detector is able to naturally handle the spatial displacement within action tubes and therefore provides a more effective way for spatio-temporal modeling. Expand
SLAC: A Sparsely Labeled Dataset for Action Classification and Localization
TLDR
The proposed procedure dramatically reduces the amount of human labeling by automatically identifying hard clips, i.e., clips that contain coherent actions but lead to prediction disagreement between action classifiers, thus generating labels for highly informative samples at little cost. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 66 REFERENCES
Spot On: Action Localization from Pointly-Supervised Proposals
TLDR
An overlap measure between action proposals and points is introduced and incorporated into the objective of a non-convex Multiple Instance Learning optimization and shows that the approach is competitive to the state-of-the-art. Expand
Moments in Time Dataset: One Million Videos for Event Understanding
TLDR
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis. Expand
SLAC: A Sparsely Labeled Dataset for Action Classification and Localization
TLDR
The proposed procedure dramatically reduces the amount of human labeling by automatically identifying hard clips, i.e., clips that contain coherent actions but lead to prediction disagreement between action classifiers, thus generating labels for highly informative samples at little cost. Expand
Learning to Track for Spatio-Temporal Action Localization
TLDR
The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features, then tracks high-scoring proposals throughout the video using a tracking-by-detection approach that outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP. Expand
Human Action Localization with Sparse Spatial Supervision
We introduce an approach for spatio-temporal human action localization using sparse spatial supervision. Our method leverages the large amount of annotated humans available today and extracts humanExpand
Actions in context
TLDR
This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video. Expand
Action Tubelet Detector for Spatio-Temporal Action Localization
TLDR
The proposed ACtion Tubelet detector (ACT-detector) takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores, based on anchor cuboids that outperforms the state-of-the-art methods for frame-mAP and video-m AP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds. Expand
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
TLDR
A novel variant of long short-term memory deep networks is defined for modeling these temporal relations via multiple input and output connections and it is shown that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction. Expand
Actions as space-time shapes
TLDR
The method is fast, does not require video alignment and is applicable in many scenarios where the background is known, and the robustness of the method is demonstrated to partial occlusions, non-rigid deformations, significant changes in scale and viewpoint, high irregularities in the performance of an action and low quality video. Expand
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
TLDR
A huge leap forward in action detection performance is achieved and 20% and 11% gain in mAP are reported on UCF-101 and J-HMDB-21 datasets respectively when compared to the state-of-the-art. Expand
...
1
2
3
4
5
...