Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

@article{Liu2021SwinTH,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={9992-10002}
}
  • Ze Liu, Yutong Lin, B. Guo
  • Published 2021
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows… 
SimViT: Exploring a Simple Vision Transformer with sliding windows
TLDR
This paper introduces a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers, and introduces Multi-head Central SelfAttention(MCSA) instead of conventional Multi- head SelfAtt attention to capture highly local relations.
What Makes for Hierarchical Vision Transformer?
TLDR
The study reveals that the macro architecture of Swin model families, other than specific aggregation layers or specific means of cross window communication, may be more responsible for its strong performance and is the real challenger to CNN’s dense sliding window paradigm.
Vision Transformer with Convolutions Architecture Search
TLDR
The proposed topology based on the multi-head attention mechanism and CNN adaptively associates relational features of pixels with multi-scale features of objects enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
ResT: An Efficient Transformer for Visual Recognition
TLDR
Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
Vicinity Vision Transformer
TLDR
This work presents a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity and proposes a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy.
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
TLDR
A Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE, which has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively and scale up the model to 644M parameters and obtain the state-of-the-art classification performance.
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
TLDR
A new vision transformer is proposed, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code and the depth-wise convolution is introduced to complement the spatial shuffle for enhancing neighbor-window connections.
CAT: Cross Attention in Vision Transformer
TLDR
A new attention mechanism in Transformer is proposed termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information.
SepViT: Separable Vision Transformer
TLDR
The novel window token embedding and grouped self-attention are employed to model the attention relationship among windows with negligible computational cost and capture a long-range visual dependencies of multiple windows, respectively.
Vision Transformer Architecture Search
TLDR
This paper designs a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer.
...
...

References

SHOWING 1-10 OF 88 REFERENCES
Toward Transformer-Based Object Detection
TLDR
This paper determines that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results, and views ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
TLDR
The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.
Bottleneck Transformers for Visual Recognition
TLDR
BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
TLDR
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.
Transformer in Transformer
TLDR
It is pointed out that the attention inside these local patches are also essential for building visual transformers with high performance and a new architecture, namely, Transformer iN Transformer (TNT), is explored.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder
TLDR
An attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion is presented.
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
TLDR
This paper deploys a pure transformer to encode an image as a sequence of patches, termed SEgmentation TRansformer (SETR), and shows that SETR achieves new state of the art on ADE20K, Pascal Context, and competitive results on Cityscapes.
ResNeSt: Split-Attention Networks
TLDR
A simple and modular Split-Attention block that enables attention across feature-map groups ResNet-style is presented that preserves the overall ResNet structure to be used in downstream tasks straightforwardly without introducing additional computational costs.
...
...