• Corpus ID: 233209859

LocalViT: Bringing Locality to Vision Transformers

  title={LocalViT: Bringing Locality to Vision Transformers},
  author={Yawei Li and K. Zhang and Jie Cao and Radu Timofte and Luc Van Gool},
We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to… 

Figures and Tables from this paper

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

This paper proposes an efficient approach to add locality to the ViT architecture, and develops a new image size curriculum learning strategy, which allows to reduce the number of patches extracted from each image at the beginning of the training.

Transformers in computational visual media: A survey

This study comprehensively surveys recent visual transformer works and focuses on visual transformer methods in low-level vision and generation, which use a self-attention mechanism rather than the RNN sequential structure.

T$^{3}$SR: Texture Transfer Transformer for Remote Sensing Image Superresolution

  • Durong CaiP. Zhang
  • Computer Science
    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
  • 2022
An end-to-end image super resolution network called texture transfer transformer for remote sensing image superresolution (SR) is proposed and an U-Transformer-based feature fusion scheme is proposed to reduce the dependence on the reference image.

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

This paper profoundly explores how the macro architecture of the hybrid CNNs/ViTs enhances the performances of hierarchical ViTs, and systemically reveals how CE injects desirable inductive bias in ViTs.

KNN Local Attention for Image Restoration

This paper proposes a new attention mechanism for image restoration, called k-NN Image Transformer (KiT), that outperforms state-of-the-art restoration approaches on image denoising, deblurring and deraining benchmarks.

MaiT: Leverage Attention Masks for More Efficient Image Transformers

This work introduces attention masks to incorporate spatial locality into self-attention heads to address problems in model efficiency especially for embedded applications.

A Survey of Visual Transformers

This survey has reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where a taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios.

STransGAN: An Empirical Study on Transformer in GANs

This study leads to a new design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G, which achieves competitive results in both unconditional and conditional image generations.

OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

This work proposes an OmniRelational High-Order Transformer (OH-Former) to model omni-relational features for ReID and a convolution-based local relation perception module is proposed to extract the local relations and 2D position information.

MISSFormer: An Effective Medical Image Segmentation Transformer

MISSFormer is a hierarchical encoder-decoder network and has two appealing designs: a feed forward network is redesigned with the proposed Enhanced Transformer Block, which makes features aligned adaptively and enhances the long-range dependencies and local context of multi-scale features generated by the hierarchical transformer encoder.



MobileNetV2: Inverted Residuals and Linear Bottlenecks

A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

This paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain, and develops a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction.

Searching for MobileNetV3

This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art of MobileNets.

Transformer in Transformer

It is pointed out that the attention inside these local patches are also essential for building visual transformers with high performance and a new architecture, namely, Transformer iN Transformer (TNT), is explored.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.

End-to-End Object Detection with Transformers

This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  • Ze LiuYutong Lin B. Guo
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.