• Corpus ID: 249191235

On the Surrogate Gap between Contrastive and Supervised Losses

  title={On the Surrogate Gap between Contrastive and Supervised Losses},
  author={Han Bao and Yoshihiro Nagano and Kento Nozawa},
Contrastive representation learning encourages data representation to make semantically similar pairs closer than randomly drawn negative samples, which has been successful in various domains such as vision, language, and graphs. Recent theoretical studies have attempted to explain the benefit of the large negative sample size by upper-bounding the downstream classification loss with the contrastive loss. However, the previous surrogate bounds have two drawbacks: they are only legitimate for a… 

Figures and Tables from this paper



A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes and shows that learned representations can reduce (labeled) sample complexity on downstream tasks.

Investigating the Role of Negatives in Contrastive Representation Learning

Theoretically, the existence of a collision-coverage trade-off is shown suggesting that the optimal number of negative examples should scale with the number of underlying concepts in the data, and empirical results find that the results broadly agree with theory and suggest future directions to better align theory and practice.

Formal Limitations on the Measurement of Mutual Information

It is proved that any distribution-free high-confidence lower bound on mutual information estimated from N samples cannot be larger than O(ln N ).

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Learning a similarity metric discriminatively, with application to face verification

The idea is to learn a function that maps input patterns into a target space such that the L/sub 1/ norm in the target space approximates the "semantic" distance in the input space.