• Corpus ID: 234094263

ResMLP: Feedforward networks for image classification with data-efficient training

@article{Touvron2021ResMLPFN,
  title={ResMLP: Feedforward networks for image classification with data-efficient training},
  author={Hugo Touvron and Piotr Bojanowski and Mathilde Caron and Matthieu Cord and Alaaeldin El-Nouby and Edouard Grave and Gautier Izacard and Armand Joulin and Gabriel Synnaeve and Jakob Verbeek and Herv'e J'egou},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.03404}
}
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity… 
An Image Patch is a Wave: Phase-Aware Vision MLP
TLDR
A novel Wave-MLP architecture is established based on the wave-like token representation that is superior to the state-of-theart MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.
MetaFormer is Actually What You Need for Vision
  • Weihao Yu, Mi Luo, +5 authors Shuicheng Yan
  • Computer Science
    ArXiv
  • 2021
TLDR
It is argued that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks, and calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules.
PatchCleanser: Certifiably Robust Defense against Adversarial Patches for Any Image Classifier
TLDR
PatchCleanser is proposed as a certifiably robust defense against adversarial patches that is compatible with any image classifier and it is proved that the defense will always make correct predictions on certain images against any adaptive white-box attacker within the authors' threat model, achieving certified robustness.
Can Attention Enable MLPs To Catch Up With CNNs?
TLDR
A brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers, and the views on challenges and directions for new learning architectures are given, hoping to inspire future research.
Graph-less Neural Networks: Teaching Old MLPs New Tricks via Distillation
TLDR
A comprehensive analysis of GLNN shows when and why GLNNs can achieve competitive results to GNNs and suggests GLNN as a handy choice for latency-constrained applications.
RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
TLDR
It is indicated that MLP-based models have the potential to replace CNNs by adopting inductive bias and the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage.
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality
  • Xiaohan Ding, Honghao Chen, Xiangyu Zhang, Jungong Han, Guiguang Ding
  • Computer Science
    ArXiv
  • 2021
TLDR
A methodology to incorporate local priors into an FC layer via merging the trained parameters of a parallel conv kernel into the FC kernel, and a novel architecture named RepMLPNet, which uses three FC layers to extract features, and is the first MLP that seamlessly transfer to Cityscapes semantic segmentation.
Rethinking Token-Mixing MLP for MLP-based Vision Backbone
TLDR
It is discovered that token-mixing MLPs in existing MLP-based backbones are spatial-specific, and thus it is sensitive to spatial translation, and an improved structure is proposed termed as Circulant Channel-Specific (CCS) token- Mixing MLP, which is spatialinvariant and channel-specific.
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition
TLDR
Vision Permutator is presented, a conceptually simple and data efficient MLP-like architecture for visual recognition that separately encodes the feature representations along the height and width dimensions with linear projections to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction.
Are we ready for a new paradigm shift? A Survey on Visual Deep MLP
TLDR
The investigation shows that MLPs’ resolution-sensitivity and computational densities remain unresolved, and pure MLPs are gradually evolving towards CNN-like, and it is suggested that the current data volume and computational power are not ready to embracepure MLPs, and artificial visual guidance remains important.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 71 REFERENCES
ImageNet Large Scale Visual Recognition Challenge
TLDR
The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Going deeper with Image Transformers
TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), and implements DINO, a form of self-distillation with no labels, which is implemented into a simple self- supervised method.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Differentiable Model Compression via Pseudo Quantization Noise
TLDR
DIFFQ is a differentiable method for model compression for quantizing model parameters without gradient approximations (e.g., Straight Through Estimator) and outperforms stateof-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
TLDR
This short report replaces the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension, and results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought.
Do you even need attention?
  • A stack of feed-forward layers does surprisingly well on ImageNet
  • 2021
...
1
2
3
4
5
...