A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

  title={A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond},
  author={Chaoning Zhang and Chenshuang Zhang and Junha Song and John Seon Keun Yi and Kang Zhang and In-So Kweon},
—Masked autoencoders are scalable vision learners, as the title of MAE [1], which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. Specifically, generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP. By contrast, early attempts at generative methods in vision have been buried by their discriminative counterparts (like contrastive learning); however, the success of mask image… 

Figures and Tables from this paper

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

AdaMAE is proposed, an adaptive masking strategy for MAEs that is end-to-end trainable and shows that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training.

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

An all-in-one single-stage pre-training approach, named M3I Pre-training, which achieves better performance than previous pretraining methods on various vision benchmarks, including ImageNet classification, COCO.

RGMIM: Region-Guided Masked Image Modeling for COVID-19 Detection

In this paper, a new masking strategy is designed that uses lung mask information to locate valid regions to learn more helpful information for COVID-19 detection and experimental results show that RGMIM can outperform other state-of-the-art self-supervised learning methods on an open CO VID-19 radiography dataset.



How to Understand Masked Autoencoders

This paper proposes a unified theoretical framework that provides a mathematical understanding for MAE and explains the patch-based attention approaches of MAE using an integral kernel under a non-overlapping domain decomposition setting.

MST: Masked Self-Supervised Transformer for Visual Representation

A novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information and outperforms supervised methods with the same epoch by 0.4%.

An Empirical Study of Training Self-Supervised Vision Transformers

This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

Siamese Image Modeling for Self-Supervised Vision Representation Learning

Siamese Image Modeling is proposed, which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations, and can surpass both ID and MIM on a wide range of downstream tasks.

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

This work proposes local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7 × 7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image.

Masked Image Modeling with Denoising Contrast

This work unleash the great potential of contrastive learning on denoising auto-encoding and introduces a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction.

Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

Architecture-Agnostic Masked Image Modeling - From ViT back to CNN

It is observed that MIM essentially teaches the model to learn better middle-level interactions among patches and extract more generalized features, and an Architecture-Agnostic Masked Image Modeling framework is proposed, which is compatible with not only Transformers but also CNNs in a unified way.

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and

Multimodal Masked Autoencoders Learn Transferable Representations

This paper proposes a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction, and finds that M3AE is able to learn generalizable representations that transfer well to downstream tasks.