• Corpus ID: 208527697

Probing the State of the Art: A Critical Look at Visual Representation Evaluation

  title={Probing the State of the Art: A Critical Look at Visual Representation Evaluation},
  author={Cinjon Resnick and Zeping Zhan and Joan Bruna},
Self-supervised research improved greatly over the past half decade, with much of the growth being driven by objectives that are hard to quantitatively compare. These techniques include colorization, cyclical consistency, and noise-contrastive estimation from image patches. Consequently, the field has settled on a handful of measurements that depend on linear probes to adjudicate which approaches are the best. Our first contribution is to show that this test is insufficient and that models… 

Figures and Tables from this paper

On the Origins of the Block Structure Phenomenon in Neural Network Representations
This work investigates the origin of the block structure in relation to the data and training methods, and finds that it arises from dominant datapoints — a small group of examples that share similar image statistics (e.g. background color).
Socially Supervised Representation Learning: The Role of Subjectivity in Learning Efficient Representations
The results demonstrate how communication from subjective perspectives can lead to the acquisition of more abstract representations in multi-agent systems, opening promising perspectives for future research at the intersection of representation learning and emergent communication.
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
This paper investigates how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models that is indicative of the underlying layers preserving and propagating the dominant principal component of their representations.
Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles
It is shown that ARCH is the only CP dataset to (ARCH-)rival its computer vision analog MS-COCO Captions, and conjecture that an encoder pre-trained on dense image captions learns transferable representations for most CP tasks.
Evaluating representations by the complexity of learning low-loss predictors
Two measures are introduced to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest: surplus description length (SDL) and e sample complexity (eSC).
How Useful Is Self-Supervised Pretraining for Visual Tasks?
  • Alejandro Newell, Jia Deng
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This work evaluates various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks, preparing a suite of synthetic data that enables an endless supply of annotated images as well as full control over dataset difficulty.
Self-Supervised Learning for Large-Scale Unsupervised Image Clustering
This paper proposes a simple scheme for unsupervised classification based on self-supervised representations and evaluates the proposed approach with several recent self- supervised methods showing that it achieves competitive results for ImageNet classification.
Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms
A comprehensive comparative study is performed between SSL and TL regarding which one works better under different properties of data and tasks, including domain difference between source and target tasks, the amount of pretraining data, class imbalance in source data, and usage of target data for additional pretraining, etc.


ImageNet: A large-scale hierarchical image database
A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
ActivityNet: A large-scale video benchmark for human activity understanding
This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living.
Data-Efficient Image Recognition with Contrastive Predictive Coding
This work revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations which make the variability in natural signals more predictable, and produces features which support state-of-the-art linear classification accuracy on the ImageNet dataset.
Learning Correspondence From the Cycle-Consistency of Time
A self-supervised method to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch and demonstrates the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow.
Learning deep representations by mutual information estimation and maximization
It is shown that structure matters: incorporating knowledge about locality in the input into the objective can significantly improve a representation’s suitability for downstream tasks and is an important step towards flexible formulations of representation learning objectives for specific end-goals.
Mmaction. https:// github.com/open-mmlab/mmaction, 2019
  • 2019
Pytorch lightning
  • https://github.com/ williamFalcon/pytorch-lightning,
  • 2019
Revisiting Self-Supervised Visual Representation Learning
This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning.
Scaling and Benchmarking Self-Supervised Visual Representation Learning
It is shown that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation and visual navigation using reinforcement learning.
Self-supervised Learning for Video Correspondence Flow
A simple information bottleneck is introduced that forces the model to learn robust features for correspondence matching, and prevents it from learning trivial solutions, as well as probing the upper bound by training on additional data, further demonstrating significant improvements on video segmentation.