Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  • Ze LiuYutong Lin B. Guo
  • Published 2021
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows… 

SimViT: Exploring a Simple Vision Transformer with Sliding Windows

This paper introduces a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers, and introduces Multi-head Central Self-Attention(MCSA) instead of conventional Multi- head Self-attention to capture highly local relations.

CBPT: A New Backbone for Enhancing Information Transmission of Vision Transformers

The Locally-Enhanced Window Self-attention mechanism is developed to double the receptive field and have a similar computational complexity to the typical WSA, and the Information-Enhanced Patch Merging, which solves the loss of information in sampling the attention map.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

The Cross-Shaped Window self-attention mechanism for computing self-Attention in the horizontal and vertical stripes in parallel that form a cross-shaped window is developed, with each stripe obtained by splitting the input feature into stripes of equal width.

Conmw Transformer: A General Vision Transformer Backbone With Merged-Window Attention

A new backbone network combining window-based attention and convolutional neural networks named ConMW Transformer is proposed, introducing convolution into the Transformer to help it converge quickly and improve accuracy.

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

This work degenerates the Swin Transformer to a plain Window-based (Win) Transformer by discarding sophisticated shifted window partitioning and discovers that a simple depthwise convolution is sufficient for achieving effective cross-window communications.

Vision Transformer with Convolutions Architecture Search

The proposed topology based on the multi-head attention mechanism and CNN adaptively associates relational features of pixels with multi-scale features of objects enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.

ResT: An Efficient Transformer for Visual Recognition

Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

A new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) is offered that enjoys both high efficiency and good performance in MIM and is able to remove the unnecessary"local inter-unit operations", deriving structurally simple hierarchical visiontransformers in which mask-units can be serialized like plain vision transformer.

Vision Transformer with Quadrangle Attention

This work proposes a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangles formulation, and integrates QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.

Vicinity Vision Transformer

This work presents a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity and proposes a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy.

Toward Transformer-Based Object Detection

This paper determines that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results, and views ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

Transformer in Transformer

It is pointed out that the attention inside these local patches are also essential for building visual transformers with high performance and a new architecture, namely, Transformer iN Transformer (TNT), is explored.

Bottleneck Transformers for Visual Recognition

BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

An attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion is presented.

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

This paper deploys a pure transformer to encode an image as a sequence of patches, termed SEgmentation TRansformer (SETR), and shows that SETR achieves new state of the art on ADE20K, Pascal Context, and competitive results on Cityscapes.

Do We Really Need Explicit Position Encodings for Vision Transformers?

This paper proposes to employ an implicit conditional position encodings scheme, which is conditioned on the local neighborhood of the input token, and is effortlessly implemented as what they call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework.