• Corpus ID: 235358331

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

@inproceedings{Xu2021ViTAEVT,
  title={ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias},
  author={Yufei Xu and Qiming Zhang and Jing Zhang and Dacheng Tao},
  booktitle={NeurIPS},
  year={2021}
}
Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we… 
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
TLDR
A Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE, which has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively and scale up the model to 644M parameters and obtain the state-of-the-art classification performance.
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
TLDR
Hinging on the cross-scale attention module, a versatile vision architecture is constructed, dubbed CrossFormer, which accommodates variable-sized inputs and puts forward a dynamic position bias for vision transformers to make the popular relative position bias apply toVariable-sized images.
Pruning Self-attentions into Convolutional Layers in Single Path
TLDR
A novel weight-sharing scheme between MSA and convolutional operations is proposed, delivering a single-path space to encode all candidate operations and cast the operation search problem as finding which subset of parameters to use in each MSA layer, which significantly reduces the computational cost and optimization difficulty.
MPViT: Multi-Path Vision Transformer for Dense Prediction
TLDR
This work explores multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT), which consistently achieve superior performance over state-of-theart Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
Adaptive Split-Fusion Transformer
TLDR
A new hybrid named Adaptive Split-Fusion Transformer (ASF-former) is proposed to treat convolutional and attention branches differently with adaptive weights, which outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
Ripple Attention for Visual Perception with Sub-quadratic Complexity
TLDR
This work proposes ripple attention, a sub-quadratic attention mechanism for visual perception that derives the spatial weights through a stick-breaking transformation, and designs a dynamic programming algorithm that computes weighted contributions for all queries in linear observed time.
Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints
TLDR
Three types of structures for Q, K , and V embedding are proposed and demonstrated, demonstrating the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches.
TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation
TLDR
TransCAM is a Conformer-based solution to WSSS that explicitly leverages the attention weights from the transformer branch of the Conformer to refine the CAM generated from the CNN branch and achieves a new state-of-the-art performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and test sets.
Gauge Equivariant Transformer
TLDR
The first to introduce gauge equivariance to self-attention, which can be efficiently implemented on triangle meshes is named Gauge Equivariant Transformer (GET), which achieves state-of-the-art performance on two common recognition tasks.
GMFlow: Learning Optical Flow via Global Matching
TLDR
A GMFlow framework, which consists of three main components: a customized Transformer for feature enhancement, a correlation and softmax layer for global feature matching, and a self-attention layer for flow propagation, is proposed, which outperforms 31-refinements RAFT on the challenging Sintel benchmark, while using only one re-nement and running faster.
...
1
2
3
4
...

References

SHOWING 1-10 OF 107 REFERENCES
So-ViT: Mind Visual Tokens for Vision Transformer
TLDR
This paper proposes a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification, and develops a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
TLDR
This paper proposes a dual-branch transformer to com-bine image patches of different sizes to produce stronger image features and develops a simple yet effective token fusion module based on cross attention which uses a single token for each branch as a query to exchange information with other branches.
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.
XCiT: Cross-Covariance Image Transformers
TLDR
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, and has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
Incorporating Convolution Designs into Visual Transformers
TLDR
A new Convolution-enhanced image Transformer (CeiT) is proposed which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the disadvantages of Transformers in establishing long-range dependencies.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • Ze Liu, Yutong Lin, B. Guo
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
TLDR
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Rethinking Spatial Dimensions of Vision Transformers
TLDR
A novel Pooling-based Vision Transformer (PiT) is proposed, which achieves the improved model capability and generalization performance against ViT and outperforms the baseline on several tasks such as image classification, object detection and robustness evaluation.
Co-Scale Conv-Attentional Image Transformers
TLDR
Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and convolution-like mechanisms, empowers image Transformers with enriched multi- scale and contextual modeling capabilities.
ConTNet: Why not use convolution and transformer at the same time?
TLDR
This work innovatively proposes ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields to serve as a useful backbone for CV tasks and bring new ideas for model design.
...
1
2
3
4
5
...