• Corpus ID: 235436185

BEiT: BERT Pre-Training of Image Transformers

  title={BEiT: BERT Pre-Training of Image Transformers},
  author={Hangbo Bao and Li Dong and Furu Wei},
We introduce a self-supervised vision representation model BE I T , which stands for B idirectional E ncoder representation from I mage T ransformers. Following BERT [DCLT19] developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e., image patches (such as 16 × 16 pixels), and visual tokens (i.e., discrete tokens). We first “tokenize” the original image into visual… 

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework and achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

DeiT III: Revenge of the ViT

This paper revisits the supervised training of ViTs and builds upon and simplifies a recipe introduced for training ResNet-50, and includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning.

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

A new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) is offered that enjoys both high efficiency and good performance in MIM, and the key is to remove the unnecessary ‘local inter-unit operations’.

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

An improved BERT-style image pre-training method, namely mc -BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives, and demonstrates the superiority of this method on classiflcation, segmentation, and detection tasks.

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

This work proposes to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level, and proposes vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes.

Masked Image Modeling with Denoising Contrast

This work unleash the great potential of contrastive learning on denoising auto-encoding and introduces a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction.

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

It is demonstrated that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks.

Object-wise Masked Autoencoders for Fast Pre-training

This work introduces a novel object selection and division strategy to drop non-object patches for learning object-wise representations by selective reconstruction with interested region masks and shows that current masked image encoding models learn the underlying relationship between all objects in the whole scene, instead of a single object representation.

A Closer Look at Self-supervised Lightweight Vision Transformers

This work mainly produces recipes for pre-training high-performance lightweight ViTs using masked-image-modeling-based MAE, namely MAE-lite, and reveals that properly-learned lower layers of the pre-trained models matter more than higher ones in data-sufficient downstream tasks.

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

This paper proposes an unsupervised representation learning task trained from pairs of images showing the same scene from different viewpoints, and proposes the pretext task of cross-view completion where the first input image is partially masked, and this masked content has to be reconstructed from the visible content and the second image.



Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

Self-Supervised Learning with Swin Transformers

This paper presents a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation, and enables the learnt representations on downstream tasks such as object detection and semantic segmentation.

An Empirical Study of Training Self-Supervised Vision Transformers

This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

Going deeper with Image Transformers

This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.

Unsupervised Representation Learning by Predicting Image Rotations

This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.

Generative Pretraining From Pixels

This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Scaling Vision Transformers

A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well for few-shot transfer.

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

This paper deploys a pure transformer to encode an image as a sequence of patches, termed SEgmentation TRansformer (SETR), and shows that SETR achieves new state of the art on ADE20K, Pascal Context, and competitive results on Cityscapes.