• Corpus ID: 235358331

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

@inproceedings{Xu2021ViTAEVT,
  title={ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias},
  author={Yufei Xu and Qiming Zhang and Jing Zhang and Dacheng Tao},
  booktitle={NeurIPS},
  year={2021}
}
Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we… 
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
TLDR
A Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE, which has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively and scale up the model to 644M parameters and obtain the state-of-the-art classification performance.
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
TLDR
Hinging on the cross-scale attention module, a versatile vision architecture is constructed, dubbed CrossFormer, which accommodates variable-sized inputs and puts forward a dynamic position bias for vision transformers to make the popular relative position bias apply toVariable-sized images.
Pruning Self-attentions into Convolutional Layers in Single Path
TLDR
A novel weight-sharing scheme between MSA and convolutional operations is proposed, delivering a single-path space to encode all candidate operations and cast the operation search problem as finding which subset of parameters to use in each MSA layer, which significantly reduces the computational cost and optimization difficulty.
MPViT: Multi-Path Vision Transformer for Dense Prediction
TLDR
This work explores multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT), which consistently achieve superior performance over state-of-theart Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
Adaptive Split-Fusion Transformer
TLDR
A new hybrid named Adaptive Split-Fusion Transformer (ASF-former) is proposed to treat convolutional and attention branches differently with adaptive weights, which outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
TLDR
Ex-perimental results show that the basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art.
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers
TLDR
This work proposes an early knowledge distillation framework, which is termed as DearKD, to improve the dataency required by transformers and proposes a boundary-preserving intra-divergence loss based on DeepInversion to close the performance gap against the full-data counterpart.
Inception Transformer
TLDR
A novel and general-purpose Inception Transformer is presented that effectively learns comprehensive features with both high- and low-frequency information in visual data and achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints
TLDR
Three types of structures for Q, K , and V embedding are proposed and demonstrated, demonstrating the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches.
Ripple Attention for Visual Perception with Sub-quadratic Complexity
TLDR
A novel dynamic programming algorithm is designed that weights contributions of different tokens to a query with respect to their relative spatial distances in the 2D space in linear observed time, which demonstrates the effectiveness of ripple attention on various visual tasks.
...
...

References

SHOWING 1-10 OF 108 REFERENCES
So-ViT: Mind Visual Tokens for Vision Transformer
TLDR
This paper proposes a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification, and develops a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
TLDR
GPSA is introduced, a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias and outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
TLDR
This paper proposes a dual-branch transformer to com-bine image patches of different sizes to produce stronger image features and develops a simple yet effective token fusion module based on cross attention which uses a single token for each branch as a query to exchange information with other branches.
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.
XCiT: Cross-Covariance Image Transformers
TLDR
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, which has linear complexity in the number of tokens, and allows high-resolution images processing.
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
TLDR
The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • Ze Liu, Yutong Lin, B. Guo
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
TLDR
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
TLDR
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.
Incorporating Convolution Designs into Visual Transformers
TLDR
A new Convolution-enhanced image Transformer (CeiT) is proposed which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the disadvantages of Transformers in establishing long-range dependencies.
...
...