SP-ViT: Learning 2D Spatial Priors for Vision Transformers

  title={SP-ViT: Learning 2D Spatial Priors for Vision Transformers},
  author={Yuxuan Zhou and Wangmeng Xiang and C. Li and Biao Wang and Xihan Wei and Lei Zhang and Margret Keuper and Xia Hua},
Recently, transformers have shown great potential in image classification and established state-of-the-art results on the ImageNet benchmark. However, compared to CNNs, transformers converge slowly and are prone to overfitting in low-data regimes due to the lack of spatial inductive biases. Such spatial inductive biases can be especially beneficial since the 2D structure of an input image is not well preserved in transformers. In this work, we present Spatial Prior–enhanced Self-Attention (SP… 

Figures and Tables from this paper

Hypergraph Transformer for Skeleton-based Action Recognition

A new self-attention (SA) extension is proposed, named Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order relations into the model, and it is revealed that the intra-joint modeling of simple 3D coordinates is negligible, thus proposing to remove the MLP layer.

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

3D geometric information is introduced into a human-like spatial reasoning process to capture the contextual knowledge of key objects step-by-step and achieves state-of-the-art performance on TextVQA and ST-VQ a datasets.



DeepViT: Towards Deeper Vision Transformer

This paper proposes a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost and makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.

CvT: Introducing Convolutions to Vision Transformers

A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.

CoAtNet: Marrying Convolution and Attention for All Data Sizes

CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights that vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency.

Going deeper with Image Transformers

This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

This paper proposes a dual-branch transformer to com-bine image patches of different sizes to produce stronger image features and develops a simple yet effective token fusion module based on cross attention which uses a single token for each branch as a query to exchange information with other branches.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.

Incorporating Convolution Designs into Visual Transformers

A new Convolution-enhanced image Transformer (CeiT) is proposed which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the disadvantages of Transformers in establishing long-range dependencies.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

This work designs a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime and proposes LeViT, a hybrid neural network for fast inference image classification that significantly outperforms existing convnets and vision transformers.