Bootstrapped Masked Autoencoders for Vision BERT Pretraining

@article{Dong2022BootstrappedMA,
  title={Bootstrapped Masked Autoencoders for Vision BERT Pretraining},
  author={Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and Nenghai Yu},
  journal={ArXiv},
  year={2022},
  volume={abs/2207.07116}
}
. We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 61 REFERENCES

Masked Autoencoders Are Scalable Vision Learners

TLDR
This paper develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

Context Autoencoder for Self-Supervised Representation Learning

TLDR
A novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining, and introduces an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder.

Masked Feature Prediction for Self-Supervised Visual Pre-Training

TLDR
This work presents Masked Feature Prediction (MaskFeat), a self-supervised pre-training of video models that randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.

Context Encoders: Feature Learning by Inpainting

TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

iBOT: Image BERT Pre-Training with Online Tokenizer

TLDR
A self-supervised framework iBOT that can perform masked prediction with an online tokenizer and underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation.

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

TLDR
It is demonstrated that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks.

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

TLDR
CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework and achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

Generative Pretraining From Pixels

TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.

Big Transfer (BiT): General Visual Representation Learning

TLDR
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.

A Simple Framework for Contrastive Learning of Visual Representations

TLDR
It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
...