Bootstrapped Masked Autoencoders for Vision BERT Pretraining

  title={Bootstrapped Masked Autoencoders for Vision BERT Pretraining},
  author={Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and Nenghai Yu},
. We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features… 

Figures and Tables from this paper



Masked Autoencoders Are Scalable Vision Learners

This paper develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

Context Autoencoder for Self-Supervised Representation Learning

A novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining, and introduces an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder.

Masked Feature Prediction for Self-Supervised Visual Pre-Training

This work presents Masked Feature Prediction (MaskFeat), a self-supervised pre-training of video models that randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.

Context Encoders: Feature Learning by Inpainting

It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

iBOT: Image BERT Pre-Training with Online Tokenizer

A self-supervised framework iBOT that can perform masked prediction with an online tokenizer and underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation.

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

It is demonstrated that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks.

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework and achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

Generative Pretraining From Pixels

This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.

Big Transfer (BiT): General Visual Representation Learning

By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.