• Corpus ID: 235376986

CoAtNet: Marrying Convolution and Attention for All Data Sizes

@inproceedings{Dai2021CoAtNetMC,
  title={CoAtNet: Marrying Convolution and Attention for All Data Sizes},
  author={Zihang Dai and Hanxiao Liu and Quoc V. Le and Mingxing Tan},
  booktitle={NeurIPS},
  year={2021}
}
Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: (1… 
Patches Are All You Need?
TLDR
The ConvMixer is proposed, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network.
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
TLDR
This work proposes a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format, and adopts it for various vision tasks from image to video domain, from classification to dense prediction.
ViT-P: Rethinking Data-efficient Vision Transformers from Locality
TLDR
This work constrain the self-attention of ViT to have multi-scale localized receptive field so that optimal configuration can be learned and provides empirical evidence that proper constrain of receptive field can reduce the amount of training data for vision transformers.
EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers
TLDR
This work proposes EdgeFormer, a pure ConvNet based backbone model that further strengthens these advantages by fusing the merits of vision transformers into ConvNets, and proposes global circular convolution (GCC) with position embeddings, a light-weight convolution op which boasts a global receptive field while producing location sensitive features as in local convolutions.
A Survey of Visual Transformers
TLDR
This survey has reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, and proposed the deformable attention module which combines the best of the sparse spatial sampling of deformable convo- lution, and the relation modeling capability of Transformers.
Learned Queries for Efficient Local Attention
TLDR
A new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like convolutions, that shows improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
TLDR
A novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the highlevel layers is proposed, which can be as powerful as and even outperform self-attention based models.
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
TLDR
This study trains ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
Video Transformers: A Survey
TLDR
This survey analyses and summarizes the main contributions and trends for adapting Transformers to model video data, and explores how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens.
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
TLDR
A Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE, which has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively and scale up the model to 644M parameters and obtain the state-of-the-art classification performance.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 51 REFERENCES
DeepViT: Towards Deeper Vision Transformer
TLDR
This paper proposes a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost and makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.
Incorporating Convolution Designs into Visual Transformers
TLDR
A new Convolution-enhanced image Transformer (CeiT) is proposed which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the disadvantages of Transformers in establishing long-range dependencies.
Squeeze-and-Excitation Networks
TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
High-Performance Large-Scale Image Recognition Without Normalization
TLDR
An adaptive gradient clipping technique is developed which overcomes instabilities in batch normalization, and a significantly improved class of Normalizer-Free ResNets is designed which attain significantly better performance when finetuning on ImageNet.
Scaling Vision Transformers
TLDR
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning.
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Going deeper with Image Transformers
TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.
...
1
2
3
4
5
...