Bootstrapped Masked Autoencoders for Vision BERT Pretraining

  title={Bootstrapped Masked Autoencoders for Vision BERT Pretraining},
  author={Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and Nenghai Yu},
  booktitle={European Conference on Computer Vision},
. We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features… 

Figures and Tables from this paper

Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

It is shown that RC-MAE achieves more robustness and better performance compared to MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models.

CAE v2: Context Autoencoder with CLIP Target

This work observes that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods, and finds the optimal mask ratio positively correlates to the model size.

Towards Sustainable Self-supervised Learning

It is shown that all SSL pretrained models can serve as good base models with the help of target-enhancement in TEC and the adapters andtarget-enhancing scheme in T EC enables the good adaptability to various base model targets.

A Unified View of Masked Image Modeling

Under the unified view, a simple yetective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images, achieves comparable or superior performance than state-of-the-art methods.

Spatio-Temporal Crop Aggregation for Video Representation Learning

This work proposes Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time, and demonstrates that the model yields state-of-the-art performance with linear, non-linear, and k-NN probing on common action classification datasets.

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for

A Unified View of Masked Image Modeling

Under the unified view, a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images, achieves comparable or superior performance than state-of-the-art methods.

Self-Supervised Learning based on Heat Equation

A new perspective of self-supervised learning based on extending heat equation into high dimensional feature space is presented and an insightful hypothesis on the invariance within visual representation over different shapes and textures: the linear relationship between horizontal and vertical derivatives is provided.



Context Autoencoder for Self-Supervised Representation Learning

A novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining, and introduces an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder.

Context Encoders: Feature Learning by Inpainting

It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

iBOT: Image BERT Pre-Training with Online Tokenizer

A self-supervised framework iBOT that can perform masked prediction with an online tokenizer and underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation.

Masked Feature Prediction for Self-Supervised Visual Pre-Training

This work presents Masked Feature Prediction (MaskFeat), a self-supervised pre-training of video models that randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.

BEiT: BERT Pre-Training of Image Transformers

A self-supervised vision representation model BE I T, which stands for B idirectional E ncoder representation from I mage T ransformers, is introduced and it is demonstrated that it can learn reasonable semantic regions via pre-training, unleashing the rich supervision signals contained in images.

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

It is demonstrated that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks.

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework and achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

Generative Pretraining From Pixels

This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.

Big Transfer (BiT): General Visual Representation Learning

By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.