Semi-Supervised Action Recognition with Temporal Contrastive Learning

  title={Semi-Supervised Action Recognition with Temporal Contrastive Learning},
  author={Ankit Singh and Omprakash Chakraborty and Ashutosh Varshney and Rameswar Panda and Rog{\'e}rio Schmidt Feris and Kate Saenko and Abir Das},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Learning to recognize actions from only a handful of labeled videos is a challenging problem due to the scarcity of tediously collected activity labels. We approach this problem by learning a two-pathway temporal contrastive model using unlabeled videos at two different speeds lever-aging the fact that changing video speed does not change an action. Specifically, we propose to maximize the similarity between encoded representations of the same video at two different speeds as well as minimize… 

How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs

This work uses semi-supervised learning with multiple adverb pseudo-labels to leverage videos with only action labels and gathers adverb annotations for three existing video retrieval datasets, which allows for the new tasks of recognizing adverbs in unseen action-adverb compositions and unseen domains.

SVFormer: Semi-supervised Video Transformer for Action Recognition

A novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis is introduced, which proposes a temporal warping augmentation to cover the complex temporal variation in videos.

Going Deeper into Recognizing Actions in Dark Environments: A Comprehensive Benchmark Study

The UG+ Challenge Track 2 (UG2-2) in IEEE CVPR 2021 is launched, with a goal of evaluating and advancing the robustness of AR models in dark environments and guides models to tackle such a task in both fully and semi-supervised manners.

Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition

This work proposes a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL), which introduces a lightweight auxiliary network in addition to the primary backbone, and asks them to predict pseudo-labels for each other, and observes that these two models tend to learn complementary representations from the same video clips.

Iterative Contrast-Classify For Semi-supervised Temporal Action Segmentation

This work proposes a novel way to learn frame-wise representations from temporal convolutional networks (TCNs) by clustering input features with added time-proximity conditions and multi-resolution similarity by merging representation learning with conventional supervised learning.

Learning from Temporal Gradient for Semi-supervised Action Recognition

This paper introduces temporal gradient as an additional modality for more attentive feature extraction in semi-supervised video action recognition and explicitly distills the fine-grained motion representations from temporal gradient and imposes consistency across different modalities.

Iterative Frame-Level Representation Learning And Classification For Semi-Supervised Temporal Action Segmentation

This work proposes a novel way to learn frame-wise representations from temporal convolutional networks (TCNs) by clustering input features with added time-proximity condition and multiresolution similarity by merging representation learning with conventional supervised learning.

SeqMatchNet: Contrastive Learning with Sequence Matching for Place Recognition & Relocalization

This work bridges the gap between single image representation learning and sequence matching through SeqMatchNet which transforms the single image descriptors such that they become more responsive to the sequence matching metric.

Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets

This paper proposes a new and challenging Few-Shot Visual Ques- tion Generation (FS-VQG) task and provides a comprehensive benchmark to it, and concludes that trivially extending existing VQG approaches with transfer learning or meta-learning may not be enough to tackle the inherent challenges in few-shot VZG.

An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition

This work addresses the challenge of training multi-label action recognition models from only single positive training labels by proposing two approaches that are based on generating pseudo training examples sampled from similar instances within the train set.



Spatiotemporal Contrastive Video Representation Learning

This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction -- by introducing contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content.

Watching the World Go By: Representation Learning from Unlabeled Videos

Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks.

Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones

This work presents a combination of video and instance-based adaptation methods, paired with either a classifier or an embedding-based framework to transfer the knowledge from source to target and shows that the proposed adaptation approach substantially improves the performance on these challenging and practical tasks.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Video Representation Learning by Dense Predictive Coding

With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Video Representation Learning with Visual Tempo Consistency

It is demonstrated that visual tempo can also serve as a self-supervision signal for video representation learning and is proposed to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL).

Time-Contrastive Networks: Self-Supervised Learning from Video

A self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints is proposed, and it is demonstrated that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be use as a reward function within a reinforcement learning algorithm.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.