Predicting Human Activities Using Stochastic Grammar

@article{Qi2017PredictingHA,
  title={Predicting Human Activities Using Stochastic Grammar},
  author={Siyuan Qi and Siyuan Huang and Ping Wei and Song-Chun Zhu},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1173-1181}
}
This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances. We represent the event by a spatial-temporal And-Or graph (ST-AOG). The ST-AOG is composed of a temporal… 

Figures and Tables from this paper

Learning to Abstract and Predict Human Actions
TLDR
This work proposes Hierarchical Encoder-Refresher-Anticipator, a multi-level neural machine that can learn the structure of human activities by observing a partial hierarchy of events and roll-out such structure into a future prediction in multiple levels of abstraction.
A Generalized Earley Parser for Human Activity Parsing and Prediction
TLDR
This paper generalizes the Earley parser to parse sequence data which is neither segmented nor labeled, given the output of an arbitrary probabilistic classifier, and finds the optimal segmentation and labels in the language defined by the input grammar.
Probabilistic Grammar Induction for Long Term Human Activity Parsing
TLDR
The method proposed is interpretable such that the representation of an activity can be edited by a human annotator for further increase in performance and the ability of PCFGs to represent human activities is evaluated.
Learning Asynchronous and Sparse Human-Object Interaction in Videos
TLDR
Asynchronous-Sparse Interaction Graph Networks (ASSIGN), a recurrent graph network that is able to automatically detect the structure of interaction events associated with entities in a video scene, is introduced and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
Forecasting Future Sequence of Actions to Complete an Activity
TLDR
This work presents a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture and proposes a novel loss function to cater for two types of uncertainty in the future predictions.
Skeleton-based structured early activity prediction
TLDR
A method capable of early prediction of simple and complex human activities by formulating it as a structured prediction task using probabilistic graphical models (PGM) using skeletons captured from low-cost depth sensors as high-level descriptions of the human body is proposed.
Adversarial Generative Grammars for Human Activity Prediction
TLDR
The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work.
Forecasting Future Action Sequences With Attention: A New Approach to Weakly Supervised Action Forecasting
TLDR
A model to predict actions of future unseen frames without using frame level annotations during training is proposed, and it outperforms prior models by 1.04% leveraging on proposed weakly supervised architecture, and effective use of attention mechanism and loss functions.
Learning a Generative Model for Multi‐Step Human‐Object Interactions from Videos
TLDR
A generative model based on a Recurrent Neural Network that learns the causal dependencies and constraints between individual actions and can be used to generate novel and diverse multi‐step human‐object interactions.
Time-Conditioned Action Anticipation in One Shot
TLDR
Experimental results show that the proposed time-conditioned method is capable of anticipating future actions at both short-term and long-term, and achieves state-of-the-art performance.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
Human activity prediction: Early recognition of ongoing activities from streaming videos
  • M. Ryoo
  • Computer Science
    2011 International Conference on Computer Vision
  • 2011
TLDR
The new recognition methodology named dynamic bag-of-words is developed, which considers sequential nature of human activities while maintaining advantages of the bag- of-words to handle noisy observations, and reliably recognizes ongoing activities from streaming videos with a high accuracy.
Prediction of Human Activity by Discovering Temporal Sequence Patterns
  • Kang Li, Y. Fu
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2014
TLDR
This work proposes a novel framework for long -duration complex activity prediction by discovering three key aspects of activity: Causality, Context-cue, and Predictability, and presents a predictive accumulative function (PAF) to depict the predictability of each kind of activity.
Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video
TLDR
It is argued that a hierarchical, object-oriented design lends the solution to be scalable in that higher-level reasoning components are independent from the particular low-level detector implementation and that recognition of additional activities and actions can easily be added.
Parsing video events with goal inference and intent prediction
TLDR
An event parsing algorithm based on Stochastic Context Sensitive Grammar for understanding events, inferring the goal of agents, and predicting their plausible intended actions achieves the globally optimal parsing solution in a Bayesian framework.
Inferring human intent from video by sampling hierarchical plans
TLDR
A method which allows robots to infer a human's hierarchical intent from partially observed RGBD videos by imagining how the human will behave in the future by using a Bayesian probabilistic programming framework.
Learning human activities and object affordances from RGB-D videos
TLDR
This work considers the problem of extracting a descriptive labeling of the sequence of sub-activities being performed by a human, and more importantly, of their interactions with the objects in the form of associated affordances, and formulate the learning problem using a structural support vector machine (SSVM) approach.
Predicting Actions from Static Scenes
Human actions naturally co-occur with scenes. In this work we aim to discover action-scene correlation for a large number of scene categories and to use such correlation for action prediction.
Anticipating Human Activities Using Object Affordances for Reactive Robotic Response
TLDR
This work represents each possible future using an anticipatory temporal conditional random field (ATCRF) that models the rich spatial-temporal relations through object affordances and represents each ATCRF as a particle and represents the distribution over the potential futures using a set of particles.
Actom sequence models for efficient action detection
TLDR
The model represents the temporal structure of actions as a sequence of histograms of actom-anchored visual features, which can be seen as a temporally structured extension of the bag-of-features, is flexible, sparse and discriminative.
Activity Forecasting
TLDR
The unified model uses state-of-the-art semantic scene understanding combined with ideas from optimal control theory to achieve accurate activity forecasting and shows how the same techniques can improve the results of tracking algorithms by leveraging information about likely goals and trajectories.
...
...