• Corpus ID: 233714958

MLP-Mixer: An all-MLP Architecture for Vision

@inproceedings{Tolstikhin2021MLPMixerAA,
title={MLP-Mixer: An all-MLP Architecture for Vision},
author={Ilya O. Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
booktitle={Neural Information Processing Systems},
year={2021}
}
• Published in
Neural Information Processing…
4 May 2021
• Computer Science
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e…
750 Citations

Figures and Tables from this paper

• Computer Science
2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
• 2022
This paper proposes a novel pure MLP architecture, spatial-shift MLP (S2-MLP), which accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.
• Computer Science
BMVC
• 2021
A Circulant Channel-Speciﬁc (CCS) token-mixing MLP is proposed, which is spatial-invariant and channel-speci-c, which takes fewer parameters but achieves higher classi-cation accuracy on ImageNet1K benchmark.
• Computer Science
ArXiv
• 2021
This work proposes a new unpaired image-to-image translation model called MixerGAN: a simpler MLP-based architecture that considers long-distance relationships between pixels without the need for expensive attention mechanisms and achieves competitive results when compared to prior convolutional-based methods.
• Computer Science
NeurIPS
• 2021
The Global Filter Network is presented, a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity and can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
• Computer Science
ArXiv
• 2021
The CONTAINER (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions a la Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, is presented.
• Computer Science
ArXiv
• 2021
A systematic empirical study finds that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data.
• Computer Science
Computational Visual Media
• 2021
A brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers, and the views on challenges and directions for new learning architectures are given, hoping to inspire future research.
• Computer Science
ICLR
• 2022
By promoting smoothness with a recently proposed sharpness-aware optimizer, this paper substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning.
• Computer Science
NeurIPS
• 2021
This work proposes to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity, which enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutionals networks as a model of the visual stream.
• Computer Science
IEEE transactions on pattern analysis and machine intelligence
• 2022
ResMLP is a simple residual network that alternates a linear layer in which image patches interact, independently and identically across channels, and a two-layer feed-forward network in which channels interact independently per patch that attains surprisingly good accuracy/complexity trade-offs on ImageNet.

References

SHOWING 1-10 OF 62 REFERENCES

• Computer Science
ICML
• 2016
This paper proposes a novel, simple yet effective activation scheme called concatenated ReLU (CRelu) and theoretically analyze its reconstruction property in CNNs and integrates CRelu into several state-of-the-art CNN architectures and demonstrates improvement in their recognition performance on CIFAR-10/100 and ImageNet datasets with fewer trainable parameters.
• Computer Science
Commun. ACM
• 2012
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
• Computer Science
ICLR
• 2021
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
• Computer Science
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
• 2021
It is found that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations, and Transformers are robust to the removal of almost any single layer.
• Computer Science
IEEE Transactions on Pattern Analysis and Machine Intelligence
• 2020
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
• Computer Science
ICLR
• 2020
This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.
• Computer Science
ICML
• 2021
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
• Computer Science
ICLR
• 2015
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
This work proposes $\beta$-LASSO, a simple variant of LASSO algorithm that, when applied on fully-connected networks for image classification tasks, learns architectures with local connections and achieves state-of-the-art accuracies for training fully- connected nets.
• Computer Science
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2016
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.