Unsupervised Video Understanding by Reconciliation of Posture Similarities

@article{Milbich2017UnsupervisedVU,
  title={Unsupervised Video Understanding by Reconciliation of Posture Similarities},
  author={Timo Milbich and Miguel {\'A}ngel Bautista and Ekaterina Sutter and Bj{\"o}rn Ommer},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={4404-4414}
}
Understanding human activity and being able to explain it in detail surpasses mere action classification by far in both complexity and value. The challenge is thus to describe an activity on the basis of its most fundamental constituents, the individual postures and their distinctive transitions. Supervised learning of such a fine-grained representation based on elementary poses is very tedious and does not scale. Therefore, we propose a completely unsupervised deep learning procedure based… 

Figures and Tables from this paper

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

TLDR
This work proposes a self-supervised learning method to jointly reason about spatial and temporal context for video recognition and proposes a novel permutation strategy that outperforms random permutations while significantly reducing computational and memory constraints.

Unsupervised Learning of Action Classes With Continuous Temporal Embedding

TLDR
This work uses a continuous temporal embedding of framewise features to benefit from the sequential nature of activities and identifies clusters of temporal segments across all videos that correspond to semantic meaningful action classes.

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

TLDR
The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks.

Refining the Pose: Training and Use of Deep Recurrent Autoencoders for Improving Human Pose Estimation

In this paper, a discriminative human pose estimation system based on deep learning is proposed for monocular video-sequences. Our approach combines a simple but efficient Convolutional Neural

Behavior-Driven Synthesis of Human Dynamics

TLDR
This work proposes a conditional variational framework which explicitly disentangles posture from behavior and is able to change the behavior of a person depicted in an arbitrary posture, or to even directly transfer behavior observed in a given video sequence.

An Unsupervised Framework for Online Spatiotemporal Detection of Activities of Daily Living by Hierarchical Activity Models

TLDR
The experimental data on a variety of monitoring scenarios in hospital settings reveals how this framework can be exploited to provide timely diagnose and medical interventions for cognitive disorders, such as Alzheimer’s disease.

Understanding Object Dynamics for Interactive Image-to-Video Synthesis

TLDR
This generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions and can transfer dynamics onto novel unseen object instances.

Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning

TLDR
This work proposes a sampling policy that adapts to the state of the network, which is being trained, and new permutations are sampled according to their expected utility for updating the convolutional feature representation.

Sharing Matters for Generalization in Deep Metric Learning

TLDR
Experiments show that, independent of the underlying network architecture and the specific ranking loss, the approach significantly improves performance in deep metric learning, leading to new the state-of-the-art results on various standard benchmark datasets.

PADS: Policy-Adapted Sampling for Visual Similarity Learning

TLDR
This work employs reinforcement learning and has a teacher network adjust the sampling distribution based on the current state of the learner network, which represents visual similarity, so that the adaptive sampling strategy significantly outperforms fixed sampling strategies.

References

SHOWING 1-10 OF 55 REFERENCES

Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification

TLDR
A framework for modeling motion by exploiting the temporal structure of the human activities, which represents activities as temporal compositions of motion segments, and shows that the algorithm performs better than other state of the art methods.

From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

TLDR
A novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes is presented, showing significant improvement over state-of-the-art low-level features, while providing spatiotemporal localization as additional output, which sheds further light into detailed action understanding.

Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification

TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.

LSTM Self-Supervision for Detailed Behavior Analysis

TLDR
The generality of the approach is demonstrated by successfully applying it to self-supervised learning of human posture on two standard benchmark datasets.

Learning Latent Constituents for Recognition of Group Activities in Video

TLDR
To automatically learn activity constituents that are meaningful for the collective activity, max-margin multiple instance learning is employed to jointly remove clutter from groups and focus on only the relevant samples, learn the activity constituents, and train the multi-class activity classifier.

Action Recognition by Hierarchical Mid-Level Action Elements

TLDR
This work introduces an unsupervised method that is capable of distinguishing action-related segments from background segments and representing actions at multiple spatiotemporal resolutions, and develops structured models that capture a rich set of spatial, temporal and hierarchical relations among the segments.

Learning realistic human actions from movies

TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

Unsupervised Visual Representation Learning by Context Prediction

TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.

Action bank: A high-level representation of activity in video

TLDR
Inspired by the recent object bank approach to image representation, Action Bank is presented, a new high-level representation of video comprised of many individual action detectors sampled broadly in semantic space as well as viewpoint space that is capable of highly discriminative performance.

Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching

  • Fengjun LvR. Nevatia
  • Computer Science
    2007 IEEE Conference on Computer Vision and Pattern Recognition
  • 2007
TLDR
Each action is modeled as a series of synthetic 2D human poses rendered from a wide range of viewpoints and the constraints on transition of the synthetic poses is represented by a graph model called Action Net.
...