Learning from Temporal Gradient for Semi-supervised Action Recognition

  title={Learning from Temporal Gradient for Semi-supervised Action Recognition},
  author={Junfei Xiao and Longlong Jing and Lin Zhang and Ju He and Qi She and Zongwei Zhou and Alan Loddon Yuille and Yingwei Li},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). Without specifically utilizing the temporal dynamics and inherent multimodal attributes, their results could be suboptimal. To better leverage the encoded temporal information in videos, we introduce temporal gradient as an additional modality for… 

Figures and Tables from this paper

SVFormer: Semi-supervised Video Transformer for Action Recognition

A novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis is introduced, which proposes a temporal warping augmentation to cover the complex temporal variation in videos.

MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module

An efficient network to extract spatio-temporal information with relatively low computational load (dubbed MEST) is proposed, which demonstrates the effectiveness of the network in terms of accuracy, computational cost and network scales.

Z-Domain Entropy Adaptable Flex for Semi-supervised Action Recognition in the Dark

  • Zhi Chen
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2022
The proposed Z-DEAF method aims to solve the problem of unclear classification boundaries between the categories by repeating Expanding Entropy and Shrinking Entropy, and achieves state-of-the-art results on ARID.

Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework

This paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for human activity recognition and attains an improvement in execution time up to 167 × in terms of frames per second.

On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition

This work empirically explores the low data regime for video classification and discovers that transformers perform extremely well in the low-labeled video setting compared to CNNs, and recommends that semi-supervised learning video work should consider the use of video transformers in the future.

Sequential Order-Aware Coding-Based Robust Subspace Clustering for Human Action Recognition in Untrimmed Videos

A sequential order-aware coding-based robust subspace clustering (SOAC-RSC) scheme for human action recognition achieves the state-of-the-art performance on the datasets of Keck Gesture and Weizmann, and provides competitiveperformance on the other 6 public datasets such as UCF101 and URADL for HAR task.

Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation

This work revisits the weak-to-strong consistency framework, popularized by FixMatch from semi-supervised classification, and presents a dual-stream perturbation technique, enabling two strong views to be simultaneously guided by a common weak view.

Optimized Deep-Learning-Based Method for Cattle Udder Traits Classification

The proposed optimized deep learning models for automatic analysis of udder conformation traits of cattle outperforms the reference methods recently introduced in the literature and improves the performances of the DL models by approximately 10%.



Semi-Supervised Action Recognition with Temporal Contrastive Learning

This work proposes a two-pathway temporal contrastive model using unlabeled videos at two different speeds lever-aging the fact that changing video speed does not change an action to maximize the similarity between encoded representations of the same video atTwo different speeds as well as minimize the similarities between different videos played at different speeds.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Self-supervised Co-training for Video Representation Learning

This paper investigates the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, and proposes a novel self-supervised co-training scheme to improve the popular infoNCE loss.

Spatiotemporal Contrastive Video Representation Learning

This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Recognize Actions by Disentangling Components of Dynamics

A new ConvNet architecture for video representation learning is proposed, which can derive disentangled components of dynamics purely from raw video frames, without the need of optical flow estimation.

Learning Representational Invariances for Data-Efficient Action Recognition

This paper investigates various data augmentation strategies that capture different video invariances, including photometric, geometric, temporal, and actor/scene augmentations, and adopts the state-of-the-art consistency-based semi-supervised learning framework to validate the effectiveness of the explored strong data augmented strategies.

Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.

Self-supervised Video Representation Learning by Context and Motion Decoupling

This work develops a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task that improves the quality of the learned video representation and finds the motion prediction to be a strong regularization for video networks.