• Corpus ID: 245218767

Masked Feature Prediction for Self-Supervised Visual Pre-Training

@article{Wei2021MaskedFP,
  title={Masked Feature Prediction for Self-Supervised Visual Pre-Training},
  author={Chen Wei and Haoqi Fan and Saining Xie and Chaoxia Wu and Alan Loddon Yuille and Christoph Feichtenhofer},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.09133}
}
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good… 
Masked Frequency Modeling for Self-Supervised Visual Pre-Training
TLDR
MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
Object-wise Masked Autoencoders for Fast Pre-training
TLDR
This work introduces a novel object selection and division strategy to drop non-object patches for learning object-wise representations by selective reconstruction with interested region masks and shows that current masked image encoding models learn the underlying relationship between all objects in the whole scene, instead of a single object representation.
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
TLDR
The proposed detector, named M IM D ET, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 AP box and 2.6 AP mask on COCO, and achieves better results compared with the previous best adapted Vanilla ViT detector using a more modest fine-tuning recipe.
Corrupted Image Modeling for Self-Supervised Visual Pre-Training
TLDR
CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework and achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.
Masked Image Modeling with Denoising Contrast
TLDR
This work unleash the great potential of contrastive learning on denoising auto-encoding and introduces a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction.
Masked Autoencoders for Point Cloud Self-supervised Learning
TLDR
A simple architecture entirely based on standard Transformers can surpass dedicated Transformer models from supervised learning and inspires the feasibility of applying unified architectures from languages and images to the point cloud.
MVP: Multimodality-guided Visual Pre-training
TLDR
The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which the tokenizer is replaced with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs, and the effectiveness is demonstrated.
The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training
TLDR
This work presents a new Masked Image Modeling (MIM), termed Geminated Gestalt Autoencoder (Ge 2 AE) for visual pre-training, equipped with geminated decoders in charge of reconstructing image contents from both pixel and frequency space, where each other serves as not only the complementation but also the reciprocal constraints.
Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency
Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network
Masked Siamese Networks for Label-Efficient Learning
TLDR
This work proposes Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations that improves the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification.
...
...

References

SHOWING 1-10 OF 105 REFERENCES
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.
Unsupervised Representation Learning by Predicting Image Rotations
TLDR
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.
Masked Autoencoders Are Scalable Vision Learners
TLDR
This paper develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
Context Encoders: Feature Learning by Inpainting
TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
BEiT: BERT Pre-Training of Image Transformers
TLDR
A self-supervised vision representation model BEIT, which stands for Bidirectional Encoder representation from Image Transformers, is introduced and Experimental results on image classification and semantic segmentation show that the model achieves competitive results with previous pre-training methods.
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
TLDR
A block-wise masking strategy is proposed where the neighboring video tokens are masked in both spatial and temporal domains to further capture the global content by predicting whether the video clips are sampled from the same video.
An Empirical Study of Training Self-Supervised Vision Transformers
TLDR
This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
TLDR
This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
...
...