• Corpus ID: 233714958

MLP-Mixer: An all-MLP Architecture for Vision

  title={MLP-Mixer: An all-MLP Architecture for Vision},
  author={Ilya O. Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
  booktitle={Neural Information Processing Systems},
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e… 

Figures and Tables from this paper

S2-MLP: Spatial-Shift MLP Architecture for Vision

This paper proposes a novel pure MLP architecture, spatial-shift MLP (S2-MLP), which accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

Rethinking Token-Mixing MLP for MLP-based Vision Backbone

A Circulant Channel-Specific (CCS) token-mixing MLP is proposed, which is spatial-invariant and channel-speci-c, which takes fewer parameters but achieves higher classi-cation accuracy on ImageNet1K benchmark.

MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation

This work proposes a new unpaired image-to-image translation model called MixerGAN: a simpler MLP-based architecture that considers long-distance relationships between pixels without the need for expensive attention mechanisms and achieves competitive results when compared to prior convolutional-based methods.

Global Filter Networks for Image Classification

The Global Filter Network is presented, a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity and can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.

Container: Context Aggregation Network

The CONTAINER (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions a la Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, is presented.

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

A systematic empirical study finds that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data.

Can attention enable MLPs to catch up with CNNs?

A brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers, and the views on challenges and directions for new learning architectures are given, hoping to inspire future research.

When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

By promoting smoothness with a recently proposed sharpness-aware optimizer, this paper substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning.

Towards Biologically Plausible Convolutional Networks

This work proposes to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity, which enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutionals networks as a model of the visual stream.

ResMLP: Feedforward networks for image classification with data-efficient training

ResMLP is a simple residual network that alternates a linear layer in which image patches interact, independently and identically across channels, and a two-layer feed-forward network in which channels interact independently per patch that attains surprisingly good accuracy/complexity trade-offs on ImageNet.



Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

This paper proposes a novel, simple yet effective activation scheme called concatenated ReLU (CRelu) and theoretically analyze its reconstruction property in CNNs and integrates CRelu into several state-of-the-art CNN architectures and demonstrates improvement in their recognition performance on CIFAR-10/100 and ImageNet datasets with fewer trainable parameters.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Understanding Robustness of Transformers for Image Classification

It is found that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations, and Transformers are robust to the removal of almost any single layer.

Squeeze-and-Excitation Networks

This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.

On the Relationship between Self-Attention and Convolutional Layers

This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Towards Learning Convolutions from Scratch

This work proposes $\beta$-LASSO, a simple variant of LASSO algorithm that, when applied on fully-connected networks for image classification tasks, learns architectures with local connections and achieves state-of-the-art accuracies for training fully- connected nets.

Rethinking the Inception Architecture for Computer Vision

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.