• Corpus ID: 238583030

Towards Demystifying Representation Learning with Non-contrastive Self-supervision

@article{Wang2021TowardsDR,
  title={Towards Demystifying Representation Learning with Non-contrastive Self-supervision},
  author={Xiang Wang and Xinlei Chen and Simon Shaolei Du and Yuandong Tian},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.04947}
}
Non-contrastive methods of self-supervised learning (such as BYOL and SimSiam) learn representations by minimizing the distance between two views of the same image. These approaches have achieved remarkable performance in practice, but it is not well understood 1) why these methods do not collapse to the trivial solutions and 2) how the representation is learned. Tian et al. (2021) made an initial attempt on the first question and proposed DirectPred that sets the predictor directly. In our… 

Figures and Tables from this paper

Contrasting the landscape of contrastive and non-contrastive learning
TLDR
It is shown through theoretical results and controlled experiments that even on simple data models, non-contrastive losses have a preponderance of non-collapsed bad minima, and it is shown that the training process does not avoid these minima.
The Power of Contrast for Feature Learning: A Theoretical Analysis
TLDR
It is provably shown that contrastive learning outperforms autoencoder, a classical unsupervised learning method, for both feature recovery and downstream tasks, and the role of labeled data in supervised contrastivelearning is illustrated.
One Network Doesn't Rule Them All: Moving Beyond Handcrafted Architectures in Self-Supervised Learning
TLDR
This work establishes extensive empirical evidence showing that a network architecture plays a significant role in SSL, and proposes to learn not only network weights but also architecture topologies in the SSL regime.
Learning distinct features helps, provably
TLDR
This work theoretically investigates how learning non-redundant distinct features affects the performance of the network and derives novel generalization bounds depending on feature diversity based on Rademacher complexity for two-layer neural networks with least squares loss.

References

SHOWING 1-10 OF 15 REFERENCES
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
TLDR
This paper introduces VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually.
Learning Representations by Maximizing Mutual Information Across Views
TLDR
This work develops a model which learns image representations that significantly outperform prior methods on the tasks the authors consider, and extends this model to use mixture-based representations, where segmentation behaviour emerges as a natural side-effect.
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
TLDR
This paper suggests that, sometimes, increasing depth can speed up optimization and proves that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.
A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks
TLDR
The speed of convergence to global optimum for gradient descent training a deep linear neural network is analyzed by minimizing the $\ell_2$ loss over whitened data by maximizing the initial loss of any rank-deficient solution.
A mathematical theory of semantic development in deep neural networks
TLDR
Notably, this simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
Large Batch Training of Convolutional Networks
TLDR
It is argued that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge and a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS) is proposed.
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
TLDR
It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.
High-Dimensional Statistics
TLDR
This book provides a self-contained introduction to the area of high-dimensional statistics, aimed at the first-year graduate level, and includes chapters that are focused on core methodology and theory - including tail bounds, concentration inequalities, uniform laws and empirical process, and random matrices.
High-Dimensional Probability
TLDR
A broad range of illustrations is embedded throughout, including classical and modern results for covariance estimation, clustering, networks, semidefinite programming, coding, dimension reduction, matrix completion, machine learning, compressed sensing, and sparse regression.
Introduction to the non-asymptotic analysis of random matrices
TLDR
This is a tutorial on some basic non-asymptotic methods and concepts in random matrix theory, particularly for the problem of estimating covariance matrices in statistics and for validating probabilistic constructions of measurementMatrices in compressed sensing.
...
1
2
...