mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

  title={mc-BEiT: Multi-choice Discretization for Image BERT Pre-training},
  author={Xiaotong Li and Yixiao Ge and Kun Yi and Zixuan Hu and Ying Shan and Ling-yu Duan},
  booktitle={European Conference on Computer Vision},
Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT [1], casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE [2]. Despite a feasible solution, the improper discretization hinders further im-provements of image pre-training. Since image discretization has no ground-truth answers, we believe… 

Tables from this paper

Point-McBert: A Multi-choice Self-supervised Framework for Point Cloud Pre-training

The Point-McBert, a pre-training framework with eased and refined supervision signals, is proposed, which improves the performance of Point-Bert on all downstream tasks, and incurs almost no extra com- putational overhead.

Boosting Point-BERT by Multi-choice Tokens

This work proposes the McP-BERT, a pre-training framework with multi-choice tokens that improves the performance of Point-berT on all downstream tasks, but also incurs almost no extra computational overhead and utilitzes the high-level semantics learned by transformer to further refine the authors' supervision signals.

Masked Image Modeling with Denoising Contrast

This work unleash the great potential of contrastive learning on denoising auto-encoding and introduces a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction.

MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation

A Masked Image Consistency module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition and improves the state-of-the-art performance across the different recognition tasks.

CAE v2: Context Autoencoder with CLIP Target

This work observes that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods, and finds the optimal mask ratio positively correlates to the model size.

VLMAE: Vision-Language Masked Autoencoder

A vision-language masked autoencoder framework (VLMAE), which employs visual generative learning, facilitating the model to acquireained and unbiased features in image and language modeling.

MILAN: Masked Image Pretraining on Language Assisted Representation

This work proposes masked image pretraining on language assisted representation, dubbed as MILAN, and proposes a more efficient prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model.

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

This work conducts a comprehensive survey of masked autoencoders to shed insight on a promising direction of SSL, and focuses on its application in vision by discussing its historical developments, recent progress, and implications for diverse applications.

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

This paper explores a potential visual analogue of words, i.e., semantic parts, and integrates semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy, and shows that SemMAE can learn better image representation by integrating semantic information.



Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene

Image BERT Pre-training with Online Tokenizer

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

It is demonstrated that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks.

Zero-Shot Text-to-Image Generation

This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Context Autoencoder for Self-Supervised Representation Learning

A novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining, and introduces an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder.

Masked Autoencoders Are Scalable Vision Learners

This paper develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

An Empirical Study of Training Self-Supervised Vision Transformers

This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Improved Baselines with Momentum Contrastive Learning

With simple modifications to MoCo, this note establishes stronger baselines that outperform SimCLR and do not require large training batches, and hopes this will make state-of-the-art unsupervised learning research more accessible.