A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers

  title={A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers},
  author={Yuzhong Chen and Yu Du and Zhe Xiao and Lin Zhao and Lu Zhang and David Weizhong Liu and Dajiang Zhu and Tuo Zhang and Xintao Hu and Tianming Liu and Xi Jiang},
Vision transformer (ViT) and its variants have achieved remarkable successes in various visual tasks. The key characteristic of these ViT models is to adopt different aggregation strategies of spatial patch information within the artificial neural networks (ANNs). However, there is still a key lack of unified representation of different ViT architectures for systematic understanding and assessment of model representation performance. Moreover, how those well-performing ViT ANNs are similar to… 

Figures from this paper



Graph Structure of Neural Networks

A novel graph-based representation of neural networks called relational graph is developed, where layers of neural network computation correspond to rounds of message exchange along the graph structure, which shows that a "sweet spot" of relational graphs leads to neural networks with significantly improved predictive performance.

MetaFormer is Actually What You Need for Vision

It is argued that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks, a general architecture abstracted from transformers without specifying the token mixer.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  • Ze LiuYutong Lin B. Guo
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.

A Visual Vocabulary for Flower Classification

  • M. NilsbackAndrew Zisserman
  • Computer Science
    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)
  • 2006
It is demonstrated that by developing a visual vocabulary that explicitly represents the various aspects that distinguish one flower from another, it can overcome the ambiguities that exist between flower categories.

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Super-convergence: very fast training of neural networks using large learning rates

A phenomenon is described, where neural networks can be trained an order of magnitude faster than with standard training methods, and it is shown that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited.

ImageNet Large Scale Visual Recognition Challenge

The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

This work proposes a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format, and adopts it for various vision tasks from image to video domain, from classification to dense prediction.