Corpus ID: 236965572

RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

  title={RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?},
  author={Yuki Tatsunami and M. Taki},
For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of selfattention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the… Expand

Figures and Tables from this paper

BioLCNet: Reward-modulated Locally Connected Spiking Neural Networks
A locally connected spiking neural network (SNN) trained using spike-timingdependent plasticity ( STDP) and its reward-modulated variant (R-STDP) learning rules is designed, which led to the nomenclature BioLCNet for the proposed architecture. Expand
MOI-Mixer: Improving MLP-Mixer with Multi Order Interactions in Sequential Recommendation
The Multi-Order Interaction (MOI) layer is proposed, which is capable of expressing an arbitrary order of interactions within the inputs while maintaining the memory and time complexity of the MLP layer. Expand


S2-MLP: Spatial-Shift MLP Architecture for Vision
A novel pure MLP architecture, spatial-shift MLP (S-MLP), which accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters. Expand
Rethinking the Inception Architecture for Computer Vision
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. Expand
Improved Regularization of Convolutional Neural Networks with Cutout
This paper shows that the simple regularization technique of randomly masking out square regions of input during training, which is called cutout, can be used to improve the robustness and overall performance of convolutional neural networks. Expand
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
This paper proposes a dual-branch transformer to combine image patches of different sizes to produce stronger image features to learn multi-scale feature representations in transformer models for image classification and develops a simple yet effective token fusion module based on cross attention. Expand
DeepViT: Towards Deeper Vision Transformer
This paper proposes a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost and makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models. Expand
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual RecognitionExpand
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
On the Relationship between Self-Attention and Convolutional Layers
This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Expand
Attention Augmented Convolutional Networks
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. Expand
XCiT: Cross-Covariance Image Transformers
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, and has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Expand