Action Unit Memory Network for Weakly Supervised Temporal Action Localization

  title={Action Unit Memory Network for Weakly Supervised Temporal Action Localization},
  author={Wang Luo and Tianzhu Zhang and Wenfei Yang and Jingen Liu and Tao Mei and Feng Wu and Yongdong Zhang},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. However, without frame-level annotations, it is challenging to achieve localization completeness and relieve background interference. In this paper, we present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization, which can mitigate the above two challenges by learning an action unit memory bank. In the proposed… 

Figures and Tables from this paper

Multi-Scale Structure-Aware Network for Weakly Supervised Temporal Action Detection
This is the first work to fully explore the global and local structure information in a unified deep model for weakly supervised action detection, and extensive experimental results demonstrate that the proposed MSA-Net performs favorably against state-of-the-art methods.
Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization
This paper proposes a novel framework, where dense pseudo-labels are generated to provide completeness guidance for the model, and demonstrates the superiority of the method over existing state-ofthe-art methods on four benchmarks: THUMOS’14, GTEA, BEOID, and ActivityNet.
Robust Pedestrian Attribute Recognition Using Group Sparsity for Occlusion Videos
This paper formulate finding non-occluded frames as sparsity-based temporal attention of a crowded video so that a model is guided not to pay attention to the occluded frame in order to solve the uncorrelated attention issue.


Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization
A Two-Stream Consensus Network (TSCN) to simultaneously address weakly-supervised Temporal Action Localization challenges and a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.
Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization
This work identifies two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation, and proposes a multi-branch neural network in which branches are enforced to discover distinctive action parts.
Background Suppression Network for Weakly-supervised Temporal Action Localization
Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each
Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks
  • Zi-yi Liu, Le Wang, +4 authors G. Hua
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
The Contrast-based Localization EvaluAtioN Network (CleanNet) is proposed with the new action proposal evaluator, which provides pseudo-supervision by leveraging the temporal contrast in snippet-level action classification predictions, and is an integral part of CleanNet which enables end-to-end training.
Weakly Supervised Action Localization by Sparse Temporal Pooling Network
This work proposes a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks that attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.
AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos
A novel weakly-supervised TAL framework called AutoLoc is developed to directly predict the temporal boundary of each action instance and a novel Outer-Inner-Contrastive (OIC) loss is proposed to automatically discover the needed segment-level supervision for training such a boundary predictor.
Learning Temporal Co-Attention Models for Unsupervised Video Action Localization
This work proposes a two-step ``clustering + localization" iterative procedure, which can be regarded as a direct extension of the weakly-supervised ACL model, and introduces new losses specially designed for ACL, including action-background separation loss and cluster-based triplet loss.
Weakly-Supervised Action Localization by Generative Attention Modeling
This paper proposes to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE), and demonstrates advantage of the method and effectiveness in handling action-context confusion problem.
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
A novel loss function for the localization network is proposed to explicitly consider temporal overlap and achieve high temporal localization accuracy in untrimmed long videos.
Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action Detection
This paper proposes a segregated temporal assembly recurrent (STAR) network for weakly-supervised multiple action detection and designs a score term called segregated temporal gradient-weighted class activation mapping (ST-GradCAM) fused with attention weights.