Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency

  title={Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency},
  author={Viraj Prabhu and Sriram Yenamandra and Aaditya Singh and Judy Hoffman},
Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network architectures initialized with supervised ImageNet representations. In this work, we shift focus to adapting modern architectures for object recognition – the increasingly popular Vision Transformer (ViT) – and modern pretraining based on self-supervised learning (SSL). Inspired by the design of recent SSL… 
1 Citations

Aerial Image Object Detection With Vision Transformer Detector (ViTDet)

The empirical study shows that ViTDet’s simple design achieves good performance on natural scene images and can be easily embedded into any detector architecture and that it achieves the competitive performance for oriented bounding box (OBB) object detection.



TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

This paper first comprehensively investigates the transferability of ViT on a variety of domain adaptation tasks and proposes an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit theTransferability Adaption Module (TAM), a novel and effective unit which compels ViT to focus on both transferable and discriminative features.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Context Encoders: Feature Learning by Inpainting

It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

Masked Feature Prediction for Self-Supervised Visual Pre-Training

This work presents Masked Feature Prediction (MaskFeat), a self-supervised pre-training of video models that randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.

Masked Siamese Networks for Label-Efficient Learning

This work proposes Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations that improves the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification.

Momentum Contrast for Unsupervised Visual Representation Learning

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a

CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation

This paper designs a two-way centeraware labeling algorithm to produce pseudo labels for samples in target domain, and a weight-sharing triple-branch transformer framework is proposed to apply self-attention and cross-att attention for source/target feature learning and source-target domain alignment, respectively.

Learning Transferable Features with Deep Adaptation Networks

A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.

Unsupervised Visual Representation Learning by Context Prediction

It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.