Corpus ID: 235367962

Scaling Vision Transformers

@article{Zhai2021ScalingVT,
  title={Scaling Vision Transformers},
  author={Xiaohua Zhai and Alexander Kolesnikov and N. Houlsby and L. Beyer},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.04560}
}
Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model’s scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up… Expand
A Survey on Vision Transformer
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representationExpand
VOLO: Vision Outlooker for Visual Recognition
TLDR
A novel outlook attention is introduced and presented, termed Vision Outlooker (VOLO), which efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Expand
Scaling Vision with Sparse Mixture of Experts
TLDR
This work presents a Vision MoE (V-MoE), a sparse version of the Vision Transformer that is scalable and competitive with the largest dense networks, and proposes an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. Expand
CoAtNet: Marrying Convolution and Attention for All Data Sizes
TLDR
CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention and (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Expand
BEiT: BERT Pre-Training of Image Transformers
TLDR
A self-supervised vision representation model BEIT, which stands for Bidirectional Encoder representation from Image Transformers, is introduced and Experimental results on image classification and semantic segmentation show that the model achieves competitive results with previous pre-training methods. Expand
Go Wider Instead of Deeper
TLDR
This paper proposes a framework to deploy trainable parameters efficiently, by going wider instead of deeper, which plays the role to transform various semantic representations, which makes the model more parameter-efficient and effective. Expand
Towards Efficient and Data Agnostic Image Classification Training Pipeline for Embedded Systems
Nowadays deep learning-based methods have achieved a remarkable progress at the image classification task among a wide range of commonly used datasets (ImageNet, CIFAR, SVHN, Caltech 101, SUN397,Expand
Easy and Efficient Transformer : Scalable Inference Solution For large NLP model
TLDR
A series of transformer inference optimization techniques that are both in algorithm level and hardware level are introduced, including a pre-padding decoding mechanism that improves token parallelism for text generation, and highly optimized kernels designed for very long input length and large hidden size. Expand
Pruning vs XNOR-Net: A Comprehensive Study of Deep Learning for Audio Classification on Edge-devices
Deep Learning has celebrated resounding successes in many application areas of relevance to the Internet-of-Things, for example, computer vision and machine listening. To fully harness the power ofExpand
Robust fine-tuning of zero-shot models
Large pre-trained models such as CLIP offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). AlthoughExpand
...
1
2
...

References

SHOWING 1-10 OF 40 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
TLDR
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. Expand
Bottleneck Transformers for Visual Recognition
TLDR
BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented. Expand
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks
TLDR
This work presents an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set, and reduces the computation time of self-attention from quadratic to linear in the number of Elements in the set. Expand
Big Transfer (BiT): General Visual Representation Learning
TLDR
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
TLDR
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise. Expand
Revisiting ResNets: Improved Training and Scaling Strategies
TLDR
It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. Expand
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. Expand
Exploring the Limits of Weakly Supervised Pretraining
TLDR
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date. Expand
...
1
2
3
4
...