Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

  title={Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos},
  author={Serena Yeung and Olga Russakovsky and Ning Jin and Mykhaylo Andriluka and Greg Mori and Li Fei-Fei},
  journal={International Journal of Computer Vision},
Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long… 

Instance-Aware Detailed Action Labeling in Videos

This work proposes an instance-aware sequence labeling method that utilizes the cues from action instance detection and designs an LSTM-based fusion network that integrates framewise action labeling and action instance prediction to produce a final consistent labeling.

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

The Multi-Moments in Time dataset (M-MiT) is presented which includes over two million action labels for over one million three second videos and introduces novel challenges on how to train and analyze models for multi-action detection.

Structured Learning for Action Recognition in Videos

A novel architecture consisting of a correlation learning and input synthesis network, long short-term memory (LSTM), and a hierarchical classifier is proposed which utilizes the simultaneous occurrence of general actions such as run and jump to refine the prediction on their correlated actions.

Relational Action Forecasting

The approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes, and learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling the method to tackle challenging visual data.

TAN: Temporal Aggregation Network for Dense Multi-Label Action Recognition

Experiments show that the TAN model is well suited for dense multi-label action recognition, which is a challenging sub-topic of action recognition that requires predicting multiple action labels in each frame.

Human Action Recognition Based on Selected Spatio-Temporal Features via Bidirectional LSTM

This paper proposes a novel framework that can select the discriminative part in the spatial dimension and enrich the modeling action of motion in the temporal dimension using multiple layers of a long short-term memory framework, which can learn compositional representations in space and time.

Asynchronous Temporal Fields for Action Recognition

This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network.

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos

A novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data.

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities.

VideoGraph: Recognizing Minutes-Long Human Activities in Videos

The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation, and it is demonstrated that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.



Parsing Videos of Actions with Segmental Grammars

This work describes simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine, which makes parsing linear time, constant storage, and naturally online easier.

Finding action tubes

This work addresses the problem of action detection in videos using rich feature hierarchies derived from shape and kinematic cues and extracts spatio-temporal feature representations to build strong classifiers using Convolutional Neural Networks.

Describing Videos by Exploiting Temporal Structure

This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.

Learning latent temporal structure for complex event detection

A conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks is utilized.

Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism

This work applies a long short-term memory (LSTM) network in two configurations: with a recently introduced soft-attention mechanism, and without, to the application of RNNs to video description, finding that incorporating a soft-Attention mechanism into the text generation RNN significantly improves the quality of the descriptions.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models

This paper proposes a discriminative semi-Markov model approach, and defines a set of features over boundary frames, segments, as well as neighboring segments that enable it to conveniently capture a combination of local and global features that best represent each specific action type.

Unsupervised Learning of Video Representations using LSTMs

This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.

Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities

  • M. RyooJ. Aggarwal
  • Computer Science
    2009 IEEE 12th International Conference on Computer Vision
  • 2009
A novel matching, spatio-temporal relationship match, which is designed to measure structural similarity between sets of features extracted from two videos, thereby enabling detection and localization of complex non-periodic activities.