UntrimmedNets for Weakly Supervised Action Recognition and Detection

  title={UntrimmedNets for Weakly Supervised Action Recognition and Detection},
  author={Limin Wang and Yuanjun Xiong and Dahua Lin and Luc Van Gool},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Limin WangYuanjun Xiong L. Gool
  • Published 9 March 2017
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the… 

Figures and Tables from this paper

Weakly-Supervised Action Recognition and Localization via Knowledge Transfer

A novel weakly-supervised action recognition framework for untrimmed videos to use only video-level annotations to transfer information from publicly available trimmed videos to assist in model learning, namely KTUntrimmedNet.

Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision

A novel weakly supervised framework to simultaneously locate action frames as well as recognize actions in untrimmed videos, which takes advantage of the self-attention mechanism to weight each frame, such that the influence of background frames can be effectively eliminated.

Action Recognition From Single Timestamp Supervision in Untrimmed Videos

This work proposes a method that is supervised by single timestamps located around each action instance, in untrimmed videos, that replaces expensive action bounds with sampling distributions initialised from these timestampeds, and demonstrates that these distributions converge to the location and extent of discriminative action segments.

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

A novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously and takes advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated.

ActionBytes: Learning From Trimmed Videos to Localize Actions

The advantage of ActionBytes for zero-shot localization as well as traditional weakly supervised localization, that train on long videos, to achieve state-of-the-art results are shown.

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

This paper presents a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate and reports the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly -supervised methods on UCF101-24.

AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos

A novel weakly-supervised TAL framework called AutoLoc is developed to directly predict the temporal boundary of each action instance and a novel Outer-Inner-Contrastive (OIC) loss is proposed to automatically discover the needed segment-level supervision for training such a boundary predictor.

Live Video Action Recognition from Unsupervised Action Proposals

This work introduces a live video action detection application which integrates the action classifier step with an unsupervised and online TAPs generator, and evaluates, for the first time, the precision of this novel pipeline for the problem of action detection in untrimmed videos.

Deep Learning-Based Action Detection in Untrimmed Videos: A Survey

This article provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly- supervised, unsuper supervised, self-super supervision, and semi-super supervised.

Weakly Supervised Temporal Action Localization Using Deep Metric Learning

  • Ashraful IslamR. Radke
  • Computer Science
    2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2020
This work proposes a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training, and proposes a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances.



Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

A novel loss function for the localization network is proposed to explicitly consider temporal overlap and achieve high temporal localization accuracy in untrimmed long videos.

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities.

Actionness Estimation Using Hybrid Fully Convolutional Networks

A new deep architecture for actionness estimation is presented, called hybrid fully convolutional network (HFCN), which is composed of appearance FCN (A-FCN) and motionFCN (M-FCNs), which leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion.

Automatic annotation of human actions in video

This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data.

Temporal Action Localization with Pyramid of Score Distribution Features

A Pyramid of Score Distribution Feature (PSDF) is proposed to capture the motion information at multiple resolutions centered at each detection window, which mitigates the influence of unknown action position and duration, and shows significant performance gain over previous detection approaches.

Weakly supervised learning of actions from transcripts

Temporal Action Detection Using a Statistical Language Model

This work proposes a novel method for temporal action detection including statistical length and language modeling to represent temporal and contextual structure and reports state-of-the-art results on three datasets.

3D Convolutional Neural Networks for Human Action Recognition

A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.