Motion-aware Contrastive Video Representation Learning via Foreground-background Merging

  title={Motion-aware Contrastive Video Representation Learning via Foreground-background Merging},
  author={Shuangrui Ding and Maomao Li and Tianyu Yang and Rui Qian and Haohang Xu and Qingyi Chen and Jue Wang},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Shuangrui DingMaomao Li Jue Wang
  • Published 30 September 2021
  • Computer Science
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
In light of the success of contrastive learning in the image domain, current self-supervised video representation learning methods usually employ contrastive loss to facilitate video representation learning. When naively pulling two augmented views of a video closer, the model however tends to learn the common static background as a shortcut but fails to capture the motion information, a phenomenon dubbed as background bias. Such bias makes the model suffer from weak generalization ability… 

Dual Contrastive Learning for Spatio-temporal Representation

A novel dual contrastive formulation is presented that learns effective spatio-temporal representations and obtains state-of-the-art or comparable performance on UCF-101, HMDB-51, and Diving-48 datasets.

PreViTS: Contrastive Pretraining with Video Tracking Supervision

This work proposes PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips containing the same object, which helps better utilize temporal transformations of objects.

Evaluating and Mitigating Static Bias of Action Representations in the Background and the Foreground

A simple yet effective video data augmentation technique, StillMix, that automatically identifies bias-inducing video frames and improves TSM and Video Swin Transformer by more than 10% of accuracy on SCUB for OOD action recognition.

Frequency Selective Augmentation for Video Representation Learning

Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated

Static and Dynamic Concepts for Self-supervised Video Representation Learning

A novel learning scheme to first learn general visual concepts then attend to discriminative local areas for video understanding, which utilizes static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space.

iQuery: Instruments as Queries for Audio-Visual Sound Separation

“visually named” queries are utilized to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms, and an additional query is inserted as an audio prompt while freezing the attention mechanism.

Face-to-Face Contrastive Learning for Social Intelligence Question-Answering

Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions using factorization nodes to contextualize the multimodal face- to-face interaction along the boundaries of the speaking turn is proposed.

HiSA: Hierarchically Semantic Associating for Video Temporal Grounding

Hierarchically Semantic Associating is proposed, which aims to precisely align the video with language and obtain discriminative representation for further location regression, and significantly outperforms the state-of-the-art VTG methods.



Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning

Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by getting supervision from the data itself, but some of the current methods tend to cheat from the background, so this work proposes to remove the background impact by adding the background.

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

This work proposes to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid and the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced.

Motion-Focused Contrastive Learning of Video Representations*

A Motion-focused Contrastive Learning method that capitalizes on optical flow of each frame in a video to temporally and spatially sample the tubelets as data augmentations and aligns gradient maps of the convolutional layers to optical flow maps from spatial, temporal and spatio-temporal perspectives, in order to ground motion information in feature learning.

Self-supervised Video Representation Learning by Context and Motion Decoupling

This work develops a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task that improves the quality of the learned video representation and finds the motion prediction to be a strong regularization for video networks.

Spatiotemporal Contrastive Video Representation Learning

This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.

MoDist: Motion Distillation for Self-supervised Video Representation Learning

It is shown that the representation learned with the MoDist method focus more on foreground motion regions and thus generalizes better to downstream tasks and can be as effective as (in some cases even better than) representations learned with full supervision.

Self-supervised Co-training for Video Representation Learning

This paper investigates the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, and proposes a novel self-supervised co-training scheme to improve the popular infoNCE loss.

Cycle-Contrast for Self-Supervised Video Representation Learning

It is demonstrated that the video representation learned by CCL can be transferred well to downstream tasks of video understanding, outperforming previous methods in nearest neighbour retrieval and action recognition tasks on UCF101, HMDB51 and MMAct.

VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

This paper improves the temporal feature representations of MoCo from two perspectives and improves MoCo temporally based on contrastive learning by empowering the temporal robustness of the encoder and modeling the temporal decay of the keys.

Broaden Your Views for Self-Supervised Video Learning

It is demonstrated that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.