Corpus ID: 232478851

Composable Augmentation Encoding for Video Representation Learning

  title={Composable Augmentation Encoding for Video Representation Learning},
  author={Chen Sun and Arsha Nagrani and Yonglong Tian and C. Schmid},
We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of representational invariances to the view selection mechanism (e.g., sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine… Expand

Figures and Tables from this paper


Memory-augmented Dense Predictive Coding for Video Representation Learning
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories. Expand
Video Representation Learning by Dense Predictive Coding
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet. Expand
Can Temporal Information Help with Contrastive Self-Supervised Learning?
This work presents Temporal-aware Contrastive self-supervised learning TaCo, a general paradigm to enhance video CSL that selects a set of temporal transformations not only as strong data augmentation but also to constitute extraSelf-supervision for video understanding. Expand
Contrastive Bidirectional Transformer for Temporal Representation Learning
This paper adopts the stacked transformer architecture, but generalizes its training objective to maximize the mutual information between the masked signals, and the bidirectional context, via contrastive loss, which enables the model to handle continuous signals, such as visual features. Expand
Video Representation Learning by Recognizing Temporal Transformations
This work promotes an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions by introducing the following transformations: forward-backward playback, random frame skipping, and uniform frame skipping. Expand
Evolving Losses for Unsupervised Video Representation Learning
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones. Expand
The Visual Task Adaptation Benchmark
Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets. Yet, the absence of a unified yardstick to evaluate general visualExpand
Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
This work demonstrates that approaches like MOCO and PIRL learn occlusion-invariant representations, but they fail to capture viewpoint and category instance invariance which are crucial components for object recognition, and proposes an approach to leverage unstructured videos to learn representations that possess higher viewpoint invariance. Expand
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose. Expand
Time-Contrastive Networks: Self-Supervised Learning from Video
A self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints is proposed, and it is demonstrated that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be use as a reward function within a reinforcement learning algorithm. Expand