Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling

@article{Richard2017WeaklySA,
  title={Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling},
  author={Alexander Richard and Hilde Kuehne and Juergen Gall},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={1273-1282}
}
We present an approach for weakly supervised learning of human actions. Given a set of videos and an ordered list of the occurring actions, the goal is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. To address this task, we propose a combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to… 

Figures and Tables from this paper

A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation
TLDR
This work proposes a hierarchical approach to address the problem of weakly supervised learning of human actions from ordered action labels by structuring recognition in a coarse-to-fine manner and shows a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment.
Weakly Supervised Energy-Based Learning for Action Segmentation
TLDR
A new constrained discriminative forward loss (CDFL) that is used for training the HMM and GRU under weak supervision and gives superior results to those of the state of the art on the benchmark Breakfast Action, Hollywood Extended, and 50Salads datasets.
Weakly Supervised Energy-Based Learning for Action Segmentation
TLDR
A new constrained discriminative forward loss (CDFL) that is used for training the HMM and GRU under weak supervision and gives superior results to those of the state of the art on the benchmark Breakfast Action, Hollywood Extended, and 50Salads datasets.
Weakly Supervised Energy-Based Learning for Action Segmentation
TLDR
A new constrained discriminative forward loss (CDFL) that is used for training the HMM and GRU under weak supervision and gives superior results to those of the state of the art on the benchmark Breakfast Action, Hollywood Extended, and 50Salads datasets.
Weakly Supervised Energy-Based Learning for Action Segmentation
TLDR
A new constrained discriminative forward loss (CDFL) that is used for training the HMM and GRU under weak supervision and gives superior results to those of the state of the art on the benchmark Breakfast Action, Hollywood Extended, and 50Salads datasets.
SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation
TLDR
This work assumes that for each training video only the list of actions is given that occur in the video, but not when, how often, and in which order they occur, and proposes an approach that can be trained end-to-end on such data.
Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning
TLDR
This work designs an architecture consisting of a Union-of-Subspaces Network, which is an ensemble of autoencoders, each modeling a low-dimensional action subspace and can capture variations of an action within and across videos.
Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment
  • Li Ding, Chenliang Xu
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A novel action modeling framework is proposed, which consists of a new temporal convolutional network, named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting frame-wise action labels, and a novel training strategy for weakly-supervised sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align action sequences and update the network in an iterative fashion.
Modeling Sub-Actions for Weakly Supervised Temporal Action Localization
TLDR
This paper describes a novel approach to alleviate the contradiction for detecting more complete action instances by explicitly modeling sub-actions, and devise three complementary loss functions, namely, representation loss, balance loss and relation loss to ensure the learned sub- actions are diverse and have clear semantic meanings.
Weakly Supervised Temporal Action Localization Using Deep Metric Learning
  • Ashraful Islam, R. Radke
  • Computer Science
    2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2020
TLDR
This work proposes a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training, and proposes a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Weakly supervised learning of actions from transcripts
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
TLDR
The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities.
Watch-n-patch: Unsupervised understanding of actions and relations
TLDR
The model learns the high-level action co-occurrence and temporal relations between the actions in the activity video and is applied to unsupervised action segmentation and recognition, and also to a novel application that detects forgotten actions, which is called action patching.
Automatic annotation of human actions in video
TLDR
This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data.
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Temporal Action Detection Using a Statistical Language Model
TLDR
This work proposes a novel method for temporal action detection including statistical length and language modeling to represent temporal and contextual structure and reports state-of-the-art results on three datasets.
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Weakly Supervised Action Labeling in Videos under Ordering Constraints
TLDR
It is shown that the action label assignment can be determined together with learning a classifier for each action in a discriminative manner and evaluated on a new and challenging dataset of 937 video clips.
End-to-End Learning of Action Detection from Frame Glimpses in Videos
TLDR
A fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions and uses REINFORCE to learn the agent's decision policy.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
...
...