GLiT: Neural Architecture Search for Global and Local Image Transformer

  title={GLiT: Neural Architecture Search for Global and Local Image Transformer},
  author={Boyu Chen and Peixia Li and Chuming Li and Baopu Li and Lei Bai and Chen Lin and Ming Sun and Junjie Yan and Wanli Ouyang},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
We introduce the first Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition. Recently, transformers without CNN-based backbones are found to achieve impressive performance for image recognition. However, the transformer is designed for NLP tasks and thus could be sub-optimal when directly used for image recognition. In order to improve the visual representation ability for transformers, we propose a new search space and searching algorithm… 

Figures and Tables from this paper

Training-free Transformer Architecture Search

Experimental results demonstrate that the TF- TAS achieves a competitive performance against the state-of-the-art manually or automatically design ViT architectures, and it promotes the searching efficiency in ViT search space greatly.

Less is More: Pay Less Attention in Vision Transformers

A hierarchical Transformer where pure multi-layer perceptrons (MLPs) are used to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers is proposed.

A Survey on Vision Transformer

  • Kai HanYunhe Wang D. Tao
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2022
This paper reviews these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages, and takes a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer.

Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure

Pruning Self-attentions into Convolutional Layers in Single Path

A novel weight-sharing scheme between MSA and convolutional operations is proposed, delivering a single-path space to encode all candidate operations and cast the operation search problem as choosing which subset of parameters to use in each MSA layer, which reduces the computational cost and optimization cost.

Vision Transformer with Deformable Attention

A novel deformable selfatt attention module is proposed, where the positions of key and value pairs in selfattention are selected in a data-dependent way, which enables the self-attention module to focus on relevant re-gions and capture more informative features.

An Image Patch is a Wave: Phase-Aware Vision MLP

  • Yehui TangKai Han Yunhe Wang
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.

An Image Patch is a Wave: Quantum Inspired Vision MLP

Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with the proposed fast distillation framework, to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data.

Transformers in Vision: A Survey

This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding.



AutoFormer: Searching Transformers for Visual Recognition

This work proposes a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search, which surpass the recent state-of-the-arts such as ViT and DeiT and achieves top-1 accuracy on ImageNet.

Learning Transferable Architectures for Scalable Image Recognition

This paper proposes to search for an architectural building block on a small dataset and then transfer the block to a larger dataset and introduces a new regularization technique called ScheduledDropPath that significantly improves generalization in the NASNet models.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Bottleneck Transformers for Visual Recognition

BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented.

Single Path One-Shot Neural Architecture Search with Uniform Sampling

A Single Path One-Shot model is proposed to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated.

Pre-Trained Image Processing Transformer

To maximally excavate the capability of transformer, the IPT model is presented to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs and the contrastive learning is introduced for well adapting to different image processing tasks.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Regularized Evolution for Image Classifier Architecture Search

This work evolves an image classifier---AmoebaNet-A---that surpasses hand-designs for the first time and gives evidence that evolution can obtain results faster with the same hardware, especially at the earlier stages of the search.

GradNet: Gradient-Guided Network for Visual Object Tracking

A novel gradient-guided network to exploit the discriminative information in gradients and update the template in the siamese network through feed-forward and backward operations and a template generalization training method is proposed to better use gradient information and avoid overfitting.