Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

  title={Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection},
  author={Yuxin Fang and Shusheng Yang and Shijie Wang and Yixiao Ge and Ying Shan and Xinggang Wang},
We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g ., only 25% ∼ 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection… 

Figures and Tables from this paper

ConvMAE: Masked Convolution Meets Masked Autoencoders

This paper demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme, and adopts the masked convolution to prevent information leakage in the convolution blocks.

Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP, demonstrating significantly higher generalization capability.

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

This work conducts a comprehensive survey of masked autoencoders to shed insight on a promising direction of SSL, and focuses on its application in vision by discussing its historical developments, recent progress, and implications for diverse applications.

Robust Multi-Object Tracking by Marginal Inference

An efficient approach to compute a marginal probability for each pair of objects in real time which is significantly more stable than the original feature distance is presented and can be applied to the existing trackers to obtain about one point improvement in terms of IDF1 metric.

Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

This paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline, and builds a 3D ViTs that performs surprisingly robustly on popular 3D tasks, compared to highly customized 3D-specific designs.



You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

It is found that YOLOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, and the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLos are discussed.

MPViT: Multi-Path Vision Transformer for Dense Prediction

This work explores multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT), which consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation.

Masked Autoencoders Are Scalable Vision Learners

It is shown that masked autoencoders (MAE) are scalable self-supervised learners for computer vision and transfer per-formance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

CoAtNet: Marrying Convolution and Attention for All Data Sizes

CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights that vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.

Masked Feature Prediction for Self-Supervised Visual Pre-Training

This work presents Masked Feature Prediction (MaskFeat), which first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework and achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

DETReg: Unsupervised Pretraining with Region Priors for Object Detection

This work introduces DETReg, a new self-supervised method that pretrains the entire object detection network, including the object localization and embedding components, and shows that it improves over competitive baselines when finetuned on COCO, PASCAL VOC, and Airbus Ship benchmarks.

BEiT: BERT Pre-Training of Image Transformers

A self-supervised vision representation model BE I T, which stands for B idirectional E ncoder representation from I mage T ransformers, is introduced and it is demonstrated that it can learn reasonable semantic regions via pre-training, unleashing the rich supervision signals contained in images.

FCOS: Fully Convolutional One-Stage Object Detection

For the first time, a much simpler and flexible detection framework achieving improved detection accuracy is demonstrated, and it is hoped that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks.

Benchmarking Detection Transfer Learning with Vision Transformers

The results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO, increasing APbox up to 4% (absolute) over supervised and prior self-supervised pre-training methods.