Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework

  title={Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework},
  author={Chenxin Tao and Honghui Wang and Xizhou Zhu and Jiahua Dong and Shiji Song and Gao Huang and Jifeng Dai},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Self-supervised learning has shown its great potential to extract powerful visual representations without human annotations. Various works are proposed to deal with self-supervised learning from different perspectives: (1) contrastive learning methods (e.g., MoCo, SimCLR) utilize both positive and negative samples to guide the training direction; (2) asymmetric network methods (e.g., BYOL, SimSiam) get rid of negative samples via the introduction of a predictor network and the stop-gradient… 

Siamese Image Modeling for Self-Supervised Vision Representation Learning

Siamese Image Modeling is proposed, which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations, and can surpass both ID and MIM on a wide range of downstream tasks.

Improving Masked Autoencoders by Learning Where to Mask

AutoMAE is presented, a fully differentiable framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process that can adaptively find patches with higher information density for different images, and strike a balance between the information gain obtained from image reconstruction and its practical training difficulty.

ContraNorm: A Contrastive Learning Perspective on Oversmoothing and Beyond

A novel normalization layer called ContraNorm is proposed, inspired by the effectiveness of contrastive learning in preventing dimensional collapse, which implicitly shatters representations in the embedding space, leading to a more uniform distribution and a slighter dimensional collapse.

Similarity Contrastive Estimation for Self-Supervised Soft Contrastive Learning

This work proposes a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE), which estimates from one view of a batch a continuous distribution to push or pull instances based on their semantic similarities.

Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning

This work proposes a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE), and shows that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.

Ladder Siamese Network: a Method and Insights for Multi-level Self-Supervised Learning

This work proposes a framework to exploit intermediate self-supervisions in each stage of deep nets, called the Ladder Siamese Network, and improves image-level classification, instance-level detection, and pixel-level segmentation simultaneously.

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

An all-in-one single-stage pre-training approach, named M3I Pre-training, which achieves better performance than previous pretraining methods on various vision benchmarks, including ImageNet classification, COCO.

FedSiam-DA: Dual-aggregated Federated Learning via Siamese Networks under Non-IID Data

FedSiam-DA, a novel dual-aggregated contrastive federated learning approach, to personalize both local and global models, under various settings of data heterogeneity, achieves outperforming several previous FL approaches on heterogeneous datasets.

Unifying Visual Contrastive Learning for Object Recognition from a Graph Perspective

This paper proposes to Unify existing unsupervised Visual Contrastive Learning methods by using a GCN layer as the predictor layer (UniVCL), which deserves two merits to un supervised learning in object recognition.

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

. Self-supervised learning methods (SSL) have achieved significant success via maximizing the mutual information between two augmented views, where cropping is a popular augmentation technique.



Unsupervised Finetuning

This paper finds the source data is crucial when shifting the finetuning paradigm from supervise to unsupervise, and proposes two simple and effective strategies to combine source and target data into unsupervised finetuned: “sparse source data replaying”, and “data mixing”.

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

This paper introduces VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually.

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

This work proposes an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible.

Exploring Simple Siamese Representation Learning

  • Xinlei ChenKaiming He
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
Surprising empirical results are reported that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders.

Improved Baselines with Momentum Contrastive Learning

With simple modifications to MoCo, this note establishes stronger baselines that outperform SimCLR and do not require large training batches, and hopes this will make state-of-the-art unsupervised learning research more accessible.

ImageNet: A large-scale hierarchical image database

A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

This work introduces Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning that performs on par or better than the current state of the art on both transfer and semi- supervised benchmarks.

Momentum Contrast for Unsupervised Visual Representation Learning

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a