Vision Transformer with Progressive Sampling

  title={Vision Transformer with Progressive Sampling},
  author={Xiaoyu Yue and Shuyang Sun and Zhanghui Kuang and Meng Wei and Philip H. S. Torr and Wayne Zhang and Dahua Lin},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Trans-former (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as… 

Adaptive Token Sampling For Efficient Vision Transformers

This work introduces a differentiable parameter-free Adaptive Token Sampler module, which can be plugged into any existing vision transformer architecture, and improves the SOTA by reducing their computational costs (GFLOPs) by 2 × , while preserving their accuracy on the ImageNet, Kinetics-400, andKinetics-600 datasets.

Adaptive Inverse Transform Sampling For Efficient Vision Transformers

This work introduces a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture, and improves the SOTA by reducing the computational cost (GFLOPs) by 2 × , while preserving the accuracy of SOTA models on ImageNet, Kinetics-400 andKinetics-600 datasets.

Vision Transformer with Deformable Attention

A novel deformable selfatt attention module is proposed, where the positions of key and value pairs in selfattention are selected in a data-dependent way, which enables the self-attention module to focus on relevant re-gions and capture more informative features.

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

This paper presents new hierarchically cascaded transformers that can improve data efficiency through attribute surrogates learning and spectral tokens pooling and shows clear advantages over SOTA few-shot classification methods in both 5- way 1-shot and 5-way 5-shot settings on four popular benchmark datasets.

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

A Pale-Shaped self-Attention (PS-Att attention) is proposed, which performs self-attention within a pale-shaped region and can reduce the computation and memory costs significantly and capture richer contextual information under the similar computation complexity with previous local self-ATTention mechanisms.

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT is presented, a self-motivated slow-fast token evolution approach for vision transformers that can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process.

Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes

A Sequential Transformers Attention Model (STAM) that only partially observes a complete image and predicts informative glimpse locations solely based on past glimpses is developed and outperforms previous state-of-the-art models by observing nearly 27% and 42% fewer pixels in glimpses on ImageNet and fMoW.

Transformers Meet Visual Learning Understanding: A Comprehensive Review

This review mainly investigates the current research progress of Transformer in image and video applications, which makes a comprehensive overview of Trans transformer in visual learning understanding.

Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation

This paper addresses panoramic semantic segmentation, which provides a full-view and dense-pixel understanding of surroundings in a holistic way and introduces the upgraded Trans4PASS+ model, featuring DMLPv2 with parallel token mixing to improve the flexibility and generalizability in modeling discriminative cues.

SoT: Delving Deeper into Classification Head for Transformer

This paper empirically disclose that high-level word tokens contain rich information, which per se are very competent with the classifier and moreover, are complementary to the classification token, and proposes multiheaded global cross-covariance pooling with singular value power normalization, which shares similar philosophy and thus is compatible with the transformer block, better than commonly used pooling methods.



Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation.

Pre-Trained Image Processing Transformer

To maximally excavate the capability of transformer, the IPT model is presented to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs and the contrastive learning is introduced for well adapting to different image processing tasks.

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

This work represents images as a set of visual tokens and applies visual transformers to find relationships between visual semantic concepts to densely model relationships between them, and finds that this paradigm of token-based image representation and processing drastically outperforms its convolutional counterparts on image classification and semantic segmentation.

Learning Texture Transformer Network for Image Super-Resolution

A novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively, which achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.

CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features

Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches, and CutMix consistently outperforms state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on ImageNet weakly-supervised localization task.

Attention Augmented Convolutional Networks

It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.

End-to-End Object Detection with Transformers

This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

Generative Pretraining From Pixels

This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.

A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition

This work proposes a simple yet robust approach for scene text recognition with no need to convert input images to sequence representations, and directly connects two-dimensional CNN features to an attention-based sequence decoder.