Representation learning from videos in-the-wild: An object-centric approach

  title={Representation learning from videos in-the-wild: An object-centric approach},
  author={Rob Romijnders and Aravindh Mahendran and Michael Tschannen and Josip Djolonga and Marvin Ritter and Neil Houlsby and Mario Lucic},
  journal={2021 IEEE Winter Conference on Applications of Computer Vision (WACV)},
We propose a method to learn image representations from uncurated videos. We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the video-shot-frame-object hierarchy present in each video. We report competitive results on 19 transfer learning tasks of the Visual Task Adaptation Benchmark (VTAB), and on 8 out-of-distribution-generalization tasks, and discuss the benefits and shortcomings of the proposed approach. In particular, it… Expand
Generalization and Robustness Implications in Object-Centric Learning
This paper trains state-of-the-art unsupervised models on five common multi-object datasets and evaluates segmentation accuracy and downstream object property prediction and finds object-centric representations to be generally useful for downstream tasks and robust to shifts in the data distribution. Expand
Object-aware Contrastive Learning for Debiased Scene Representation
A novel object-aware contrastive learning framework that first localizes objects in a self-supervised manner and then debias scene correlations via appropriate data augmentations considering the inferred object locations, which demonstrates the effectiveness of the representation learning framework when trained under multi-object images or evaluated under the background (and distribution) shifted images. Expand


Object-Centric Representation Learning from Unlabeled Videos
This work introduces a novel object-centric approach to temporal coherence that encourages similar representations to be learned for object-like regions segmented from nearby frames in a deep convolutional neural network representation. Expand
Watching the World Go By: Representation Learning from Unlabeled Videos
Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks. Expand
Unsupervised Learning of Object Structure and Dynamics from Videos
A keypoint-based image representation is adopted and a stochastic dynamics model of the keypoints is learned that outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction. Expand
Unsupervised Visual Representation Learning by Context Prediction
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Expand
Video Representation Learning by Dense Predictive Coding
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet. Expand
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose. Expand
Deep Learning of Invariant Features via Simulated Fixations in Video
This work applies salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision, and achieves state-of-the-art recognition accuracy 61% on STL-10 dataset. Expand
Deep learning from temporal coherence in video
A learning method for deep architectures that takes advantage of sequential data, in particular from the temporal coherence that naturally exists in unlabeled video recordings, and is used to improve the performance on a supervised task of interest. Expand
Learning Correspondence From the Cycle-Consistency of Time
A self-supervised method to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch and demonstrates the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Expand
Unsupervised Representation Learning by Predicting Image Rotations
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. Expand