Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

@article{Yeung2017EveryMC,
  title={Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos},
  author={Serena Yeung and Olga Russakovsky and Ning Jin and Mykhaylo Andriluka and Greg Mori and Li Fei-Fei},
  journal={International Journal of Computer Vision},
  year={2017},
  volume={126},
  pages={375-389}
}
Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long… Expand
Instance-Aware Detailed Action Labeling in Videos
TLDR
This work proposes an instance-aware sequence labeling method that utilizes the cues from action instance detection and designs an LSTM-based fusion network that integrates framewise action labeling and action instance prediction to produce a final consistent labeling. Expand
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding
TLDR
The Multi-Moments in Time dataset (M-MiT) is presented which includes over two million action labels for over one million three second videos and introduces novel challenges on how to train and analyze models for multi-action detection. Expand
Structured Learning for Action Recognition in Videos
TLDR
A novel architecture consisting of a correlation learning and input synthesis network, long short-term memory (LSTM), and a hierarchical classifier is proposed which utilizes the simultaneous occurrence of general actions such as run and jump to refine the prediction on their correlated actions. Expand
Relational Action Forecasting
TLDR
The approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes, and learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling the method to tackle challenging visual data. Expand
TAN: Temporal Aggregation Network for Dense Multi-Label Action Recognition
TLDR
Experiments show that the TAN model is well suited for dense multi-label action recognition, which is a challenging sub-topic of action recognition that requires predicting multiple action labels in each frame. Expand
Human Action Recognition Based on Selected Spatio-Temporal Features via Bidirectional LSTM
TLDR
This paper proposes a novel framework that can select the discriminative part in the spatial dimension and enrich the modeling action of motion in the temporal dimension using multiple layers of a long short-term memory framework, which can learn compositional representations in space and time. Expand
Asynchronous Temporal Fields for Action Recognition
TLDR
This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. Expand
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos
TLDR
A novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data. Expand
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
TLDR
The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities. Expand
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
TLDR
The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation, and it is demonstrated that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 63 REFERENCES
Parsing Videos of Actions with Segmental Grammars
TLDR
This work describes simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine, which makes parsing linear time, constant storage, and naturally online easier. Expand
Finding action tubes
TLDR
This work addresses the problem of action detection in videos using rich feature hierarchies derived from shape and kinematic cues and extracts spatio-temporal feature representations to build strong classifiers using Convolutional Neural Networks. Expand
Describing Videos by Exploiting Temporal Structure
TLDR
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Expand
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
TLDR
This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. Expand
Learning latent temporal structure for complex event detection
TLDR
A conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks is utilized. Expand
Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism
TLDR
This work applies a long short-term memory (LSTM) network in two configurations: with a recently introduced soft-attention mechanism, and without, to the application of RNNs to video description, finding that incorporating a soft-Attention mechanism into the text generation RNN significantly improves the quality of the descriptions. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models
TLDR
This paper proposes a discriminative semi-Markov model approach, and defines a set of features over boundary frames, segments, as well as neighboring segments that enable it to conveniently capture a combination of local and global features that best represent each specific action type. Expand
Unsupervised Learning of Video Representations using LSTMs
TLDR
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. Expand
Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities
  • M. Ryoo, J. Aggarwal
  • Computer Science
  • 2009 IEEE 12th International Conference on Computer Vision
  • 2009
TLDR
A novel matching, spatio-temporal relationship match, which is designed to measure structural similarity between sets of features extracted from two videos, thereby enabling detection and localization of complex non-periodic activities. Expand
...
1
2
3
4
5
...