Actions ~ Transformations

  title={Actions ~ Transformations},
  author={X. Wang and Ali Farhadi and Abhinav Kumar Gupta},
  journal={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  • X. Wang, Ali Farhadi, A. Gupta
  • Published 2 December 2015
  • Computer Science
  • 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
What defines an action like "kicking ball. [] Key Method Motivated by recent advancements of video representation using deep learning, we design a Siamese network which models the action as a transformation on a high-level feature space. We show that our model gives improvements on standard action recognition datasets including UCF101 and HMDB51. More importantly, our approach is able to generalize beyond learned action categories and shows significant performance improvement on cross-category…

Figures and Tables from this paper

Action Recognition Based on Discriminative Embedding of Actions Using Siamese Networks

This paper trains a Siamese deep neural network with a contrastive loss on the low-dimensional representation of a pool of attributes learned in a universal Gaussian mixture model using factor analysis to classify actions by leveraging the corresponding class labels.

Am I Done? Predicting Action Progress in Videos

A novel approach is introduced, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution, based on a combination of the Faster R-CNN framework and LSTM networks.

Encouraging LSTMs to Anticipate Actions Very Early

A new action anticipation method that achieves high prediction accuracy even in the presence of a very small percentage of a video sequence, and develops a multi-stage LSTM architecture that leverages context-aware and action-aware features, and introduces a novel loss function that encourages the model to predict the correct class as early as possible.

Procedural Generation of Videos to Train Deep Action Recognition Networks

This work proposes an interpretable parametric generative model of human action videos that relies on procedural generation and other computer graphics techniques of modern game engines, and generates a diverse, realistic, and physically plausible dataset of humanaction videos, called PHAV for Procedural Human Action Videos.

Explainable Video Action Reasoning via Prior Knowledge and State Transitions

A novel action reasoning framework that uses prior knowledge to explain semantic-level observations of video state changes and can be used to detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans.


The objective of this research work is to develop discriminative representations for human actions by combining the advantages of both low-level and high-level features and demonstrates the efficacy of sparse representation in the identification of the human body under rapid and substantial deformation.

Pose from Action: Unsupervised Learning of Pose Features based on Motion

An unsupervised method to learn pose features from videos that exploits a signal which is complementary to appearance and can be used as supervision: motion is proposed.

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

This paper develops a self-supervised model for jointly learning state-modifying actions together with the corresponding object states from an uncurated set of videos from the Internet, and incorporates a noise adaptive weighting module supervised by a small number of annotated still images.

Joint Discovery of Object States and Manipulation Actions

This work proposes a joint model that learns to identify object states and to localize state-modifying actions and demonstrates successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations.

Long-Term Temporal Convolutions for Action Recognition

It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.



Modeling Actions through State Changes

  • A. FathiJames M. Rehg
  • Computer Science
    2013 IEEE Conference on Computer Vision and Pattern Recognition
  • 2013
This paper proposes a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions and demonstrates that this method can be used to segment discrete actions from a continuous video of an activity.

Finding action tubes

This work addresses the problem of action detection in videos using rich feature hierarchies derived from shape and kinematic cues and extracts spatio-temporal feature representations to build strong classifiers using Convolutional Neural Networks.

Trajectory-Based Modeling of Human Actions with Motion Reference Points

This paper proposes a simple representation specifically aimed at the modeling of human action recognition in videos that operates on top of visual codewords derived from local patch trajectories, and therefore does not require accurate foreground-background separation, which is typically a necessary step to model object relationships.

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

Action Recognition by Hierarchical Mid-Level Action Elements

This work introduces an unsupervised method that is capable of distinguishing action-related segments from background segments and representing actions at multiple spatiotemporal resolutions, and develops structured models that capture a rich set of spatial, temporal and hierarchical relations among the segments.

Action Recognition with Actons

A two-layer structure for action recognition to automatically exploit a mid-level ``acton'' representation via a new max-margin multi-channel multiple instance learning framework, which yields the state-of-the-art classification performance on Youtube and HMDB51 datasets.

Better Exploiting Motion for Better Action Recognition

It is established that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms.

Modeling video evolution for action recognition

The proposed method to capture video-wide temporal information for action recognition postulate that a function capable of ordering the frames of a video temporally captures well the evolution of the appearance within the video.

Action Recognition by Hierarchical Sequence Summarization

This work presents a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities and shows that its complexity grows sub linearly with the size of the hierarchy.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.