• Corpus ID: 239885997

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

  title={Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition},
  author={Yulin Wang and Rui Huang and Shiji Song and Zeyi Huang and Gao Huang},
Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16 or 14x14. In this paper, we argue that every… 
Coarse-to-Fine Vision Transformer
The proposed CF-ViT implements network inference in a two-stage manner, motivated by two important observations in modern ViT models: the coarse-grained patch splitting can locate informative regions of an input image and most images can be well recognized by a ViT model in a small-length token sequence.
Super Vision Transformer
A novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs is presented.
Supervised Contrastive Representation Embedding Based on Transformer for Few-Shot Classification
This work employs Swin Transformer as the backbone to replace CNN architecture in order to explore the huge potential of transformer-based backbone for the field of few-shot learning and introduces supervised contrastive loss to meta learning to take good advantage of extremely limited relations for the first time.
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Experiments on symbolic manipulation, compositional generalization and numerical reasoning demonstrate that least-to-most prompting can generalize to examples that are harder than those seen in the prompt context, outperforming other prompting-based approaches by a large margin.
Glance and Focus Networks for Dynamic Visual Recognition
The proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient regions to learn finer features, mimicking the human visual system.
Vision Transformer with Deformable Attention
Deformable Attention Transformer is presented, a general backbone model with deformable attention for both image classification and dense prediction tasks and achieves consistently improved results on comprehensive benchmarks.
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition
This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization and presenting an improved training scheme to address the issues introduced by the one- stage formulation.
TDAM: Top-Down Attention Module for Contextually Guided Feature Selection in CNNs
A lightweight top-down attention module (TDAM) that iteratively generates a “visual searchlight” to perform channel and spatial modulation of its inputs and outputs more contextually-relevant feature maps at each computation step that enhances the performance of CNNs across multiple object-recognition benchmarks and outperforms prominent attention modules while being more parameter and memory efficient.
Sparse Fusion for Multimodal Transformers
This work presents Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing state-of-the-art methods while having greatly reduced memory footprint and computation cost.


Learning Multiple Layers of Features from Tiny Images
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.
Kai Li
  • and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In ICML, pages 248–255
  • 2009
Multi-Scale Dense Networks for Resource Efficient Image Classification
Experiments demonstrate that the proposed framework substantially improves the existing state-of-the-art in both image classification with computational resource limits at test time and budgeted batch classification.
Transformer in Transformer
It is pointed out that the attention inside these local patches are also essential for building visual transformers with high performance and a new architecture, namely, Transformer iN Transformer (TNT), is explored.
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Improved Techniques for Training Adaptive Deep Networks
This paper considers a typical adaptive deep network with multiple intermediate classifiers and presents three techniques to improve its training efficacy from two aspects: a Gradient Equilibrium algorithm to resolve the conflict of learning of different classifiers; an Inline Subnetwork Collaboration approach and a One-for-all Knowledge Distillation algorithm to enhance the collaboration among classifiers.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.