• Corpus ID: 244478047

Benchmarking Detection Transfer Learning with Vision Transformers

  title={Benchmarking Detection Transfer Learning with Vision Transformers},
  author={Yanghao Li and Saining Xie and Xinlei Chen and Piotr Doll{\'a}r and Kaiming He and Ross B. Girshick},
Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection… 

Figures and Tables from this paper

Exploring Plain Vision Transformer Backbones for Object Detection

This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training, and can compete with the previous leading methods that were all based on hierarchical backbones.

Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP, demonstrating significantly higher generalization capability.

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

The proposed detector, named M IM D ET, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 AP box and 2.6 AP mask on COCO, and achieves better results compared with the previous best adapted Vanilla ViT detector using a more modest fine-tuning recipe.

Vision Transformer Adapter for Dense Predictions

This work proposes a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture.

How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification

The results show that VTs exhibit strong generalization properties and that these networks are more powerful feature extractors than CNNs.

ConvMAE: Masked Convolution Meets Masked Autoencoders

This paper demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme, and adopts the masked convolution to prevent information leakage in the convolution blocks.

Rethinking Hierarchies in Pre-trained Plain Vision Transformer

A novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training ViT with minimal changes is proposed and outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively.

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

This paper designs an effective combination of a token clustering function and a token reconstruction function to maximize the cosine similarity between the reconstructed high-resolution feature maps and the original ones without fine-tuning.

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

An improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections is presented that outperforms prior work on image and video classification, as well as object detection.



Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

An Empirical Study of Training Self-Supervised Vision Transformers

This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

Rethinking ImageNet Pre-Training

Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy, and these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.

Going deeper with Image Transformers

This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.

Feature Pyramid Networks for Object Detection

This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.

BEiT: BERT Pre-Training of Image Transformers

A self-supervised vision representation model BE I T, which stands for B idirectional E ncoder representation from I mage T ransformers, is introduced and it is demonstrated that it can learn reasonable semantic regions via pre-training, unleashing the rich supervision signals contained in images.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

An improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections is presented that outperforms prior work on image and video classification, as well as object detection.