• Corpus ID: 246634193

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

@article{Fang2022CorruptedIM,
  title={Corrupted Image Modeling for Self-Supervised Visual Pre-Training},
  author={Yuxin Fang and Li Dong and Hangbo Bao and Xinggang Wang and Furu Wei},
  journal={ArXiv},
  year={2022},
  volume={abs/2202.03382}
}
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a… 
Masked Image Modeling with Denoising Contrast
TLDR
This work unleash the great potential of contrastive learning on denoising auto-encoding and introduces a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction.
Architecture-Agnostic Masked Image Modeling - From ViT back to CNN
TLDR
It is observed that MIM essentially teaches the model to learn better middle-level interactions among patches and extract more generalized features, and an Architecture-Agnostic Masked Image Modeling framework is proposed, which is compatible with not only Transformers but also CNNs in a unified way.
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling
TLDR
A new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) is offered that enjoys both high efficiency and good performance in MIM, and the key is to remove the unnecessary ‘local inter-unit operations’.
Masked Autoencoders are Robust Data Augmentors
TLDR
This paper adopts the self-supervised masked autoencoder to generate the distorted view of the input images and shows that utilizing such model-based nonlinear transformation as data augmentation can improve high-level recognition tasks.
The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training
TLDR
This work presents a new Masked Image Modeling (MIM), termed Geminated Gestalt Autoencoder (Ge 2 AE) for visual pre-training, equipped with geminated decoders in charge of reconstructing image contents from both pixel and frequency space, where each other serves as not only the complementation but also the reciprocal constraints.
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
TLDR
The proposed detector, named M IM D ET, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 AP box and 2.6 AP mask on COCO, and achieves better results compared with the previous best adapted Vanilla ViT detector using a more modest fine-tuning recipe.
Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction
TLDR
This work proposes local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7 × 7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image.
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers
TLDR
This paper designs five other learning objectives that follow the same procedure as MIM but degrade the input image in different ways, and finds the best practice is obtained by keeping the original image style and enriching spatial masking with spatial misalignment.
DILEMMA: Self-Supervised Shape and Texture Learning with Transformers
TLDR
A pseudo-task to explicitly boost both shape and texture discriminability in models trained via self-supervised learning and shows that it outperforms MoCoV3 and DINO when downstream tasks are strongly reliant on shape, and yields a gain over prior work.
Masked Frequency Modeling for Self-Supervised Visual Pre-Training
TLDR
MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
...
...

References

SHOWING 1-10 OF 66 REFERENCES
BEiT: BERT Pre-Training of Image Transformers
TLDR
A self-supervised vision representation model BEIT, which stands for Bidirectional Encoder representation from Image Transformers, is introduced and Experimental results on image classification and semantic segmentation show that the model achieves competitive results with previous pre-training methods.
Masked Autoencoders Are Scalable Vision Learners
TLDR
This paper develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.
Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
TLDR
This study shows that denoising autoencoders, such as BEiT or a variant that is introduced in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
Masked Feature Prediction for Self-Supervised Visual Pre-Training
TLDR
This work presents Masked Feature Prediction (MaskFeat), which first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.
An Empirical Study of Training Self-Supervised Vision Transformers
TLDR
This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.
Pre-Trained Image Processing Transformer
TLDR
To maximally excavate the capability of transformer, the IPT model is presented to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs and the contrastive learning is introduced for well adapting to different image processing tasks.
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features
TLDR
Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches, and CutMix consistently outperforms state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on ImageNet weakly-supervised localization task.
Vector-quantized Image Modeling with Improved VQGAN
TLDR
The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning, and beats iGPT-XL which is trained with extra web image data and larger model size.
...
...