Sequencer: Deep LSTM for Image Classification

  title={Sequencer: Deep LSTM for Image Classification},
  author={Yuki Tatsunami and Masato Taki},
In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance… 



Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

An attention-free network called sMLPNet is built based on the existing MLP-based vision models, replacing the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module that significantly reduces the number of model parameters and computational complexity.

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation and the problem of fixed input image resolution for global MLPs-based models is tackled by utilizing bicubic interpolation.

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

This paper proposes a dual-branch transformer to com-bine image patches of different sizes to produce stronger image features and develops a simple yet effective token fusion module based on cross attention which uses a single token for each branch as a query to exchange information with other branches.

Refiner: Refining Self-attention for Vision Transformers

This work introduces a conceptually simple scheme, called refiner, to directly refine the selfattention maps of ViTs, and explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

A novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the highlevel layers is proposed, which can be as powerful as and even outperform self-attention based models.

Going deeper with convolutions

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition