mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

  title={mc-BEiT: Multi-choice Discretization for Image BERT Pre-training},
  author={Xiaotong Li and Yixiao Ge and Kun Yi and Zixuan Hu and Ying Shan and Ling-yu Duan},
Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT [1], casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE [2]. Despite a feasible solution, the improper discretization hinders further im-provements of image pre-training. Since image discretization has no ground-truth answers, we believe… 

Point-McBert: A Multi-choice Self-supervised Framework for Point Cloud Pre-training

The Point-McBert, a pre-training framework with eased and refined supervision signals, is proposed, which improves the performance of Point-Bert on all downstream tasks, and incurs almost no extra com- putational overhead.

Boosting Point-BERT by Multi-choice Tokens

This work proposes the McP-BERT, a pre-training framework with multi-choice tokens that improves the performance of Point-berT on all downstream tasks, but also incurs almost no extra computational overhead and utilitzes the high-level semantics learned by transformer to further refine the authors' supervision signals.

Masked Image Modeling with Denoising Contrast

This work unleash the great potential of contrastive learning on denoising auto-encoding and introduces a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction.

VLMAE: Vision-Language Masked Autoencoder

A vision-language masked autoencoder framework (VLMAE), which employs visual generative learning, facilitating the model to acquireained and unbiased features in image and language modeling.

MILAN: Masked Image Pretraining on Language Assisted Representation

This work proposes masked image pretraining on language assisted representation, dubbed as MILAN, and proposes a more efficient prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model.

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

This work conducts a comprehensive survey of masked autoencoders to shed insight on a promising direction of SSL, and focuses on its application in vision by discussing its historical developments, recent progress, and implications for diverse applications.

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

This paper explores a potential visual analogue of words, i.e., semantic parts, and integrates semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy, and shows that SemMAE can learn better image representation by integrating semantic information.



Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

It is demonstrated that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks.

iBOT: Image BERT Pre-Training with Online Tokenizer

A self-supervised framework iBOT that can perform masked prediction with an online tokenizer and underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation.

Zero-Shot Text-to-Image Generation

This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Masked Autoencoders Are Scalable Vision Learners

This paper develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

An Empirical Study of Training Self-Supervised Vision Transformers

This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Momentum Contrast for Unsupervised Visual Representation Learning

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a