UntrimmedNets for Weakly Supervised Action Recognition and Detection
@article{Wang2017UntrimmedNetsFW, title={UntrimmedNets for Weakly Supervised Action Recognition and Detection}, author={Limin Wang and Yuanjun Xiong and Dahua Lin and Luc Van Gool}, journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2017}, pages={6402-6411} }
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the…
Figures and Tables from this paper
390 Citations
Weakly-Supervised Action Recognition and Localization via Knowledge Transfer
- Computer SciencePRCV
- 2019
A novel weakly-supervised action recognition framework for untrimmed videos to use only video-level annotations to transfer information from publicly available trimmed videos to assist in model learning, namely KTUntrimmedNet.
Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision
- Computer ScienceAAAI
- 2019
A novel weakly supervised framework to simultaneously locate action frames as well as recognize actions in untrimmed videos, which takes advantage of the self-attention mechanism to weight each frame, such that the influence of background frames can be effectively eliminated.
Action Recognition From Single Timestamp Supervision in Untrimmed Videos
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This work proposes a method that is supervised by single timestamps located around each action instance, in untrimmed videos, that replaces expensive action bounds with sampling distributions initialised from these timestampeds, and demonstrates that these distributions converge to the location and extent of discriminative action segments.
TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization
- Computer ScienceMachine Intelligence Research
- 2022
A novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously and takes advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated.
ActionBytes: Learning From Trimmed Videos to Localize Actions
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
The advantage of ActionBytes for zero-shot localization as well as traditional weakly supervised localization, that train on long videos, to achieve state-of-the-art results are shown.
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
- Computer ScienceECCV
- 2020
This paper presents a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate and reports the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly -supervised methods on UCF101-24.
AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos
- Computer ScienceECCV
- 2018
A novel weakly-supervised TAL framework called AutoLoc is developed to directly predict the temporal boundary of each action instance and a novel Outer-Inner-Contrastive (OIC) loss is proposed to automatically discover the needed segment-level supervision for training such a boundary predictor.
Live Video Action Recognition from Unsupervised Action Proposals
- Computer Science2021 17th International Conference on Machine Vision and Applications (MVA)
- 2021
This work introduces a live video action detection application which integrates the action classifier step with an unsupervised and online TAPs generator, and evaluates, for the first time, the precision of this novel pipeline for the problem of action detection in untrimmed videos.
Deep Learning-Based Action Detection in Untrimmed Videos: A Survey
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2023
This article provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly- supervised, unsuper supervised, self-super supervision, and semi-super supervised.
Weakly Supervised Temporal Action Localization Using Deep Metric Learning
- Computer Science2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2020
This work proposes a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training, and proposes a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances.
References
SHOWING 1-10 OF 60 REFERENCES
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
- Computer ScienceECCV
- 2016
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.…
Two-Stream Convolutional Networks for Action Recognition in Videos
- Computer ScienceNIPS
- 2014
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A novel loss function for the localization network is proposed to explicitly consider temporal overlap and achieve high temporal localization accuracy in untrimmed long videos.
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
- Computer ScienceECCV
- 2016
The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities.
Actionness Estimation Using Hybrid Fully Convolutional Networks
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A new deep architecture for actionness estimation is presented, called hybrid fully convolutional network (HFCN), which is composed of appearance FCN (A-FCN) and motionFCN (M-FCNs), which leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion.
Automatic annotation of human actions in video
- Computer Science2009 IEEE 12th International Conference on Computer Vision
- 2009
This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data.
Temporal Action Localization with Pyramid of Score Distribution Features
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A Pyramid of Score Distribution Feature (PSDF) is proposed to capture the motion information at multiple resolutions centered at each detection window, which mitigates the influence of unknown action position and duration, and shows significant performance gain over previous detection approaches.
Weakly supervised learning of actions from transcripts
- Computer ScienceComput. Vis. Image Underst.
- 2017
Temporal Action Detection Using a Statistical Language Model
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This work proposes a novel method for temporal action detection including statistical length and language modeling to represent temporal and contextual structure and reports state-of-the-art results on three datasets.
3D Convolutional Neural Networks for Human Action Recognition
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2013
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.