Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

@article{Hong2021CrossmodalCN,
  title={Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization},
  author={Fa-Ting Hong and Jia-Chang Feng and Dan Xu and Ying Shan and Wei-Shi Zheng},
  journal={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}
  • Fa-Ting Hong, Jia-Chang Feng, +2 authors Wei-Shi Zheng
  • Published 2021
  • Computer Science
  • Proceedings of the 29th ACM International Conference on Multimedia
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors,e.g., I3D, which are trained for trimmed video action classification, but not… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 54 REFERENCES
Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization
TLDR
A Two-Stream Consensus Network (TSCN) to simultaneously address weakly-supervised Temporal Action Localization challenges and a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Expand
Learning Temporal Co-Attention Models for Unsupervised Video Action Localization
TLDR
This work proposes a two-step ``clustering + localization" iterative procedure, which can be regarded as a direct extension of the weakly-supervised ACL model, and introduces new losses specially designed for ACL, including action-background separation loss and cluster-based triplet loss. Expand
ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization
TLDR
An Action-Context Separation Network (ACSNet) that explicitly takes into account context for accurate action localization and introduces extended labels with auxiliary context categories to facilitate the learning of action-context separation. Expand
3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization
TLDR
This work proposes a framework, called 3C-Net, which only requires video-level supervision (weak supervision) in the form of action category labels and the corresponding count to learn discriminative action features with enhanced localization capabilities. Expand
AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos
TLDR
A novel weakly-supervised TAL framework called AutoLoc is developed to directly predict the temporal boundary of each action instance and a novel Outer-Inner-Contrastive (OIC) loss is proposed to automatically discover the needed segment-level supervision for training such a boundary predictor. Expand
Background Suppression Network for Weakly-supervised Temporal Action Localization
Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each… Expand
Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization
TLDR
This work identifies two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation, and proposes a multi-branch neural network in which branches are enforced to discover distinctive action parts. Expand
Weakly-Supervised Action Localization by Generative Attention Modeling
TLDR
This paper proposes to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE), and demonstrates advantage of the method and effectiveness in handling action-context confusion problem. Expand
Weakly Supervised Temporal Action Localization Using Deep Metric Learning
  • Ashraful Islam, R. Radke
  • Computer Science
  • 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2020
TLDR
This work proposes a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training, and proposes a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances. Expand
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
TLDR
A novel loss function for the localization network is proposed to explicitly consider temporal overlap and achieve high temporal localization accuracy in untrimmed long videos. Expand
...
1
2
3
4
5
...