What Actions are Needed for Understanding Human Actions in Videos?

@article{Sigurdsson2017WhatAA,
  title={What Actions are Needed for Understanding Human Actions in Videos?},
  author={Gunnar A. Sigurdsson and Olga Russakovsky and Abhinav Kumar Gupta},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={2156-2165}
}
What is the right way to reason about human activities? What directions forward are most promising? In this work, we analyze the current state of human activity understanding in videos. The goal of this paper is to examine datasets, evaluation metrics, algorithms, and potential future directions. We look at the qualitative attributes that define activities such as pose variability, brevity, and density. The experiments consider multiple state-of-the-art algorithms and multiple datasets. The… 

Figures from this paper

Diagnosing Error in Temporal Action Detectors
TLDR
A new diagnostic tool to analyze the performance of temporal action detectors in videos and compare different methods beyond a single scalar metric is introduced, finding the lack of agreement among annotator is not a major roadblock to attain progress in the field.
Am I Done? Predicting Action Progress in Videos
TLDR
A novel approach is introduced, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution, based on a combination of the Faster R-CNN framework and LSTM networks.
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
TLDR
This work proposes Action Search, a novel Recurrent Neural Network approach that mimics the way humans spot actions in video, and puts forward the Human Searches dataset, which compiles the search sequences employed by human annotators spotting actions in the AVA and THUMOS14 datasets.
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
TLDR
This work builds a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved, and provides insights into some of the most recent state of the art deep video architectures.
Weakly Supervised Gaussian Networks for Action Detection
TLDR
A novel method is proposed, called WSGN, that learns to detect actions from weak supervision, using only video-level labels, that leads to significant gains in action detection for two standard benchmarks THU-MOS14 and Charades.
Temporal Relational Reasoning in Videos
TLDR
This paper introduces an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales.
STAR: A Benchmark for Situated Reasoning in Real-World Videos
TLDR
A diagnostic neuro-symbolic model is proposed that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark, called Situated Reasoning in Real-World Videos (STAR).
Temporal Relevance Analysis for Video Action Models
TLDR
A new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation is proposed, showing that there is no strong correlation between temporal relevance and model performance.
Human Action Sequence Classification
TLDR
This paper classifies human action sequences from videos using a machine translation model that is trained to output action sequences to solve downstream tasks; such as video captioning and action localization.
STAR: A Benchmark for Situated Reasoning in Real-World Videos
  • Bo Wu
  • Computer Science
  • 2021
TLDR
A diagnostic neuro-symbolic model is proposed that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Asynchronous Temporal Fields for Action Recognition
TLDR
This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network.
A combined pose, object, and feature model for action understanding
TLDR
This work presents a system that is able to recognize complex, fine-grained human actions involving the manipulation of objects in realistic action sequences by combining these elements in a single model that outperforms existing state of the art techniques on this dataset.
Recognizing realistic actions from videos “in the wild”
TLDR
This paper presents a systematic framework for recognizing realistic actions from videos “in the wild”, and uses motion statistics to acquire stable motion features and clean static features, and PageRank is used to mine the most informative static features.
ActivityNet: A large-scale video benchmark for human activity understanding
TLDR
This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living.
Detecting activities of daily living in first-person camera views
TLDR
This work presents a novel dataset and novel algorithms for the problem of detecting activities of daily living in firstperson camera views, and develops novel representations including temporal pyramids and composite object models that exploit the fact that objects look different when being interacted with.
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
TLDR
A novel variant of long short-term memory deep networks is defined for modeling these temporal relations via multiple input and output connections and it is shown that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.
HMDB: A large video database for human motion recognition
TLDR
This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.
Machine Recognition of Human Activities: A Survey
TLDR
A comprehensive survey of efforts in the past couple of decades to address the problems of representation, recognition, and learning of human activities from video and related applications is presented.
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.
...
1
2
3
4
5
...