• Corpus ID: 234469977

When Does Contrastive Visual Representation Learning Work?

@article{Cole2021WhenDC,
  title={When Does Contrastive Visual Representation Learning Work?},
  author={Elijah Cole and Xuan S. Yang and Kimberly Wilber and Oisin Mac Aodha and Serge J. Belongie},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.05837}
}
Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on ImageNet are now relatively well understood, the field still lacks widely accepted best practices for replicating this success on other datasets. As a first step in this direction, we study contrastive self-supervised learning on four diverse large-scale datasets. By looking through the lenses of data… 

Figures and Tables from this paper

Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning
TLDR
It is revealed through the analysis that in iso-lation, single popular methods should not be treated as though they represent the field as a whole, and that future work ought to consider how to leverage the complimentary nature of these methods.
Simple Control Baselines for Evaluating Transfer Learning
TLDR
This work shares an evaluation standard that aims to quantify and communicate transfer learning performance in an informative and accessible setup and encourages using/reporting the suggested control baselines in evaluating transfer learning in order to gain a more meaningful and informative understanding.
MemSAC: Memory Augmented Sample Consistency for Large Scale Domain Adaptation
TLDR
This work proposes MemSAC, which exploits sample level similarity across source and target domains to achieve discriminative transfer, along with architectures that scale to a large number of categories, and proposes and theoretically justify a novel variant of the contrastive loss to promote local consistency among within-class cross domain samples while enforcing separation between classes.
Visual Knowledge Tracing
TLDR
This work proposes models that jointly extract the visual features used by learners as well as predicting the classification functions they utilize, and collects three challenging new datasets from real human learners in order to evaluate the performance of visual knowledge tracing methods.
On Label Granularity and Object Localization
TLDR
Surprisingly, it is shown that choosing the right training label granularity provides a much larger performance boost than choosing the best WSOL algorithm and that changing the labelgranularity can significantly improve data efficiency.
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning
TLDR
This work studies this question through a carefully controlled comparison of two approaches in terms of their ability to learn representations that generalize to downstream classification tasks, finding that when the pre-training dataset meets certain criteria—it is suf ficiently large and contains descriptive captions with low variability—image-only methods do not match CLIP’s transfer performance, even when they are trained with more image data.
Learning Gait Representation from Massive Unlabelled Walking Videos: A Benchmark
TLDR
A large-scale self-supervised benchmark for gait recognition with contrastive learning, aiming to learn the general gait representation from massive unlabelled walking videos for practical applications via offering informative walking priors and diverse real-world variations is proposed.
CDNet: Contrastive Disentangled Network for Fine-Grained Image Categorization of Ocular B-Scan Ultrasound
TLDR
The proposed CDNet, aiming to tackle the fine-grained image categorization (FGIC) challenges of ocular abnormalities in ultrasound images, achieves state-of-art performance in the FGIC task.
Is Self-Supervised Learning More Robust Than Supervised Learning?
TLDR
This work designs and conducts a series of robustness tests to quantify the behavioral differences between contrastive learning and supervised learning to downstream or pretraining data distribution changes, and attempts to explain these results through the role of data augmentation and feature space properties.
Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing
TLDR
A novel SSL paradigm called Scalable Dynamic Routing (SDR), which can be trained once and deployed efficiently to different downstream tasks with task-customized pre-trained models, achieving state-of-the-art averaged accuracy over 11 downstream classification tasks and AP on PASCAL VOC detection task.
...
...

References

SHOWING 1-10 OF 67 REFERENCES
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
TLDR
This work introduces Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning that performs on par or better than the current state of the art on both transfer and semi- supervised benchmarks.
A Simple Framework for Contrastive Learning of Visual Representations
TLDR
It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
Momentum Contrast for Unsupervised Visual Representation Learning
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
TLDR
This work extensively study and validate the model performance on over 50 benchmarks including fairness, robustness to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets, and discovers that such model is more robust, more fair, less harmful and less biased than supervised models or models trained on objectcentric datasets such as ImageNet.
ResNet strikes back: An improved training procedure in timm
TLDR
This paper re-evaluate the performance of the vanilla ResNet-50 when trained with a procedure that integrates such advances, and shares competitive training settings and pre-trained models in the timm open-source library, with the hope that they will serve as better baselines for future work.
Early Convolutions Help Transformers See Better
TLDR
This work conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride- p p × p convolution ( p = 16 by default) applied to the input image, and suggests that injecting a small dose of convolutional inductive bias into the early stages of ViTs can be hugely beneficial.
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations
TLDR
The results show that an approach like MoCo works surprisingly well across: (i) objectversus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets, and that MoCo learns spatially structured representations when trained with a multi-crop strategy.
Divide and Contrast: Self-supervised Learning from Uncurated Data
TLDR
When pretrained on less curated datasets, DnC greatly improves the performance of self-supervised learning on downstream tasks, while remaining competitive with the current state-of-the-art on curated datasets.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
...
...