Probabilistic Representations for Video Contrastive Learning

@article{Park2022ProbabilisticRF,
  title={Probabilistic Representations for Video Contrastive Learning},
  author={Jungin Park and Jiyoung Lee and Ig-Jae Kim and Kwanghoon Sohn},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022},
  pages={14691-14701}
}
  • Jungin ParkJiyoung Lee K. Sohn
  • Published 8 April 2022
  • Computer Science
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
This paper presents Probabilistic Video Contrastive Learning, a self-supervised representation learning method that bridges contrastive learning with probabilistic representation. We hypothesize that the clips composing the video have different distributions in short-term duration, but can represent the complicated and sophisticated video distribution through combination in a common embedding space. Thus, the proposed method represents video clips as normal distributions and combines them into… 

Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning

This work proposes a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE), and shows that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.

UATVR: Uncertainty-Adaptive Text-Video Retrieval

This paper proposes an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each lookup as a distribution matching procedure, and adds additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for high-level reasoning.

Dual-path Adaptation from Image to Video Transformers

In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.

Boosting Semi-Supervised Semantic Segmentation with Probabilistic Representations

A Probabilistic Representation Contrastive Learning (PRCL) framework is proposed that improves representation quality by taking its probability into consideration and can tune the contribution of the ambiguous representations to tolerate the risk of inaccurate pseudo-labels.

Representing Spatial Trajectories as Distributions

A representation learning framework for spatial trajectories that can accurately predict the past and future of a trajectory segment, as well as the interpolation between two different segments, outperforming autoregressive baselines and able to obtain samples from a trajectory for any continuous point in time.

References

SHOWING 1-10 OF 90 REFERENCES

Spatiotemporal Contrastive Video Representation Learning

This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.

Time-Equivariant Contrastive Video Representation Learning

  • S. JenniHailin Jin
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
A novel self-supervised contrastive learning method to learn representations from unlabelled videos that are equivariant to temporal transformations and better capture video dynamics and achieve state-of-the-art results in video retrieval and action recognition benchmarks.

TCLR: Temporal Contrastive Learning for Video Representation

VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

This paper improves the temporal feature representations of MoCo from two perspectives and improves MoCo temporally based on contrastive learning by empowering the temporal robustness of the encoder and modeling the temporal decay of the keys.

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction -- by introducing contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content.

Active Contrastive Learning of Audio-Visual Video Representations

An active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification.

Contrast and Order Representations for Video Self-supervised Learning

A contrast-and-order representation (CORP) framework for learning self-supervised video representations that can automatically capture both the appearance information within each frame and temporal information across different frames and a novel decoupling attention method to learn symmetric similarity (contrast) and anti-symmetric patterns.

Composable Augmentation Encoding for Video Representation Learning

It is shown that representations learned by the proposed 'augmentation aware' contrastive learning framework encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.

Motion-Focused Contrastive Learning of Video Representations*

A Motion-focused Contrastive Learning method that capitalizes on optical flow of each frame in a video to temporally and spatially sample the tubelets as data augmentations and aligns gradient maps of the convolutional layers to optical flow maps from spatial, temporal and spatio-temporal perspectives, in order to ground motion information in feature learning.

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

This work proposes to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid and the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced.
...