MILAN: Masked Image Pretraining on Language Assisted Representation

  title={MILAN: Masked Image Pretraining on Language Assisted Representation},
  author={Zejiang Hou and Fei Sun and Yen-Kuang Chen and Yuan Xie and S. Y. Kung},
Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language… 

Stare at What You See: Masked Image Modeling without Reconstruction

The experimental results demonstrate that masked modeling does not lose effectiveness even without reconstruction on masked regions, and an efficient MIM paradigm named MaskAlign can achieve state-of-the-art performance with much higher ef ficiency.

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Evaluating the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

AdaMAE is proposed, an adaptive masking strategy for MAEs that is end-to-end trainable and shows that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training.

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

An all-in-one single-stage pre-training approach, named M3I Pre-training, which achieves better performance than previous pretraining methods on various vision benchmarks, including ImageNet classification, COCO.

Scaling Language-Image Pre-training via Masking

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our method randomly masks out and removes a large portion of image patches during



Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

By carefully utilizing the widespread supervision among the image-text pairs, the DeCLIP can learn generic visual features more efficiently and exploit data potential through the use of self-supervision within each modality; multi-view supervision across modalities; and nearest-neighbor supervision from other similar pairs.

MST: Masked Self-Supervised Transformer for Visual Representation

A novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information and outperforms supervised methods with the same epoch by 0.4%.

Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

This work proposes a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs and exceeds the previous SoTA of general zero-shot learning on ImageNet 21k+1k by 73% relatively with a ResNet50 image encoder and DeCLUTR text encoder.

SLIP: Self-supervision meets Language-Image Pre-training

This work explores whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers and introduces SLIP, a multi-task learning framework for combining self- supervised learning and CLIP pre-training.

Learning Visual Representations with Caption Annotations

It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks.

Generative Pretraining From Pixels

This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

This study shows that denoising autoencoders, such as BEiT or a variant that is introduced in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.

Context Encoders: Feature Learning by Inpainting

It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

DenseCL is presented, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images and outperforms the state-of-the-art methods by a large margin.

MVP: Multimodality-guided Visual Pre-training

The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which the tokenizer is replaced with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs, and the effectivenss are demonstrated.