A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers

  title={A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers},
  author={Yuzhong Chen and Yu Du and Zhe Xiao and Lin Zhao and Lu Zhang and David Weizhong Liu and Dajiang Zhu and Tuo Zhang and Xintao Hu and Tianming Liu and Xi Jiang},
Vision transformer (ViT) and its variants have achieved remarkable successes in various visual tasks. The key characteristic of these ViT models is to adopt different aggregation strategies of spatial patch information within the artificial neural networks (ANNs). However, there is still a key lack of unified representation of different ViT architectures for systematic understanding and assessment of model representation performance. Moreover, how those well-performing ViT ANNs are similar to… 

Figures from this paper



Do Vision Transformers See Like Convolutional Neural Networks?

Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, there are striking differences between the two architectures, such as ViT having more uniform representations across all layers and ViT residual connections, which strongly propagate features from lower to higher layers.

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

This work proposes a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format, and adopts it for various vision tasks from image to video domain, from classification to dense prediction.

Shunted Self-Attention via Multi-Scale Token Aggregation

A novel and generic strategy, termed shunted self-attention (SSA), that allows ViTs to model the attentions at hybrid scales per attention layer and out-performs the state-of-the-art Focal Transformer on Ima-geNet with only half of the model size and computation cost.

Graph Structure of Neural Networks

A novel graph-based representation of neural networks called relational graph is developed, where layers of neural network computation correspond to rounds of message exchange along the graph structure, which shows that a "sweet spot" of relational graphs leads to neural networks with significantly improved predictive performance.

MetaFormer is Actually What You Need for Vision

It is argued that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks, and calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules.

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

This paper proposes a dual-branch transformer to com-bine image patches of different sizes to produce stronger image features and develops a simple yet effective token fusion module based on cross attention which uses a single token for each branch as a query to exchange information with other branches.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.