Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

  title={Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization},
  author={Linjiang Huang and Liang Wang and Hongsheng Li},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
As a challenging task of high-level video understanding, weakly supervised temporal action localization has been attracting increasing attention. With only video annotations, most existing methods seek to handle this task with a localization-by-classification framework, which generally adopts a selector to select snippets of high probabilities of actions or namely the foreground. Nevertheless, the existing foreground selection strategies have a major limitation of only considering the… 

Figures and Tables from this paper

Dual-Evidential Learning for Weakly-supervised Temporal Action Localization

A generalized evidential deep learning framework for WS-TAL, called Dual-Evidential Learning for Uncertainty modeling (DELU), which extends the traditional paradigm of EDL to adapt to the weakly-supervised multi-label classification goal and achieves state-of-the-art performance on THUMOS14 and ActivityNet1.2 benchmarks.

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

A novel method from a category exclusion perspective, named Progressive Complementary Learning (ProCL), which gradually enhances the snippet-level supervision and introduces the background-aware pseudo complementary labeling in order to exclude more categories for snippets of less ambiguity.

Weakly-supervised Action Localization via Hierarchical Mining

A hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining, to maximize the usage of the given annotations and prediction-wise consistency is proposed.

ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization

ASM-Loc is proposed, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods and entails three segment-centric components: dynamic segment sampling for compensating the contribution of short actions, intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies.

Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization

An adversarial learning strategy is presented to break the limitation of mining pseudo background snippets and a novel temporal enhancement network is designed to facilitate the model to construct temporal relation of affinity snippets based on the proposed strategy, for further improving the performance of action localization.

Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation

This method seeks to mine the representative snippets in each video for propagating information between video snippets to generate better pseudo labels and obtains superior performance on two benchmarks, THUMOS14 and ActivityNet1.

Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization

This work proposes a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting.

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

A novel distillation-collaboration framework with two branches acting as CBP and VLP respectively, which is effectively fused to pro-mote a strong alliance for temporally localized action localization.

Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization

A novel C 3 BN to achieve robust snippet predictions is proposed and a macro-micro consistency regularization strategy is proposed to force the model to be invariant (or equivariant) to the transformations of snippets with respect to video semantics, snippet predictions and snippet features.

End-to-End Temporal Action Detection With Transformer

TadTR is an end-to-end Transformer-based method for temporal action detection that achieves state-of-the-art performance on THUMOS14 and HACS Segments, and requires lower computation cost than previous detectors, while preserving remarkable performance.



A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization

This paper presents a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address weakly supervised temporal action localization.

Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization

A novel weakly-supervised Action Completeness Modeling with Background Aware Networks (ACM-BANets) with an asymmetrical training strategy, to suppress both highly discriminative and ambiguous background frames to remove the false positives.

Background Suppression Network for Weakly-supervised Temporal Action Localization

Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each

Weakly-Supervised Action Localization by Generative Attention Modeling

This paper proposes to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE), and demonstrates advantage of the method and effectiveness in handling action-context confusion problem.

Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

A Two-Stream Consensus Network (TSCN) to simultaneously address weakly-supervised Temporal Action Localization challenges and a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.

Modeling Sub-Actions for Weakly Supervised Temporal Action Localization

This paper describes a novel approach to alleviate the contradiction for detecting more complete action instances by explicitly modeling sub-actions, and devise three complementary loss functions, namely, representation loss, balance loss and relation loss to ensure the learned sub- actions are diverse and have clear semantic meanings.

Weakly-supervised Temporal Action Localization by Uncertainty Modeling

A new perspective on background frames is presented where they are modeled as out-of-distribution samples regarding their inconsistency and a background entropy loss is introduced to better discriminate background frames by encouraging their in-dist distribution (action) probabilities to be uniformly distributed over all action classes.

3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization

This work proposes a framework, called 3C-Net, which only requires video-level supervision (weak supervision) in the form of action category labels and the corresponding count to learn discriminative action features with enhanced localization capabilities.

Weakly-Supervised Action Localization With Background Modeling

A latent approach that learns to detect actions in long sequences given training videos with only whole-video class labels, and can be used to aggressively scale-up learning to in-the-wild, uncurated Instagram videos (where relevant frames and videos are automatically selected through attentional processing).

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

This work explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework, and derives two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound.