ResMLP: Feedforward networks for image classification with data-efficient training

  title={ResMLP: Feedforward networks for image classification with data-efficient training},
  author={Hugo Touvron and Piotr Bojanowski and Mathilde Caron and Matthieu Cord and Alaaeldin El-Nouby and Edouard Grave and Gautier Izacard and Armand Joulin and Gabriel Synnaeve and Jakob Verbeek and Herv'e J'egou},
  journal={IEEE transactions on pattern analysis and machine intelligence},
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity… 

EfficientFormer: Vision Transformers at MobileNet Speed

This work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance 1 based architectures, whereby the latency-driven analysis of ViT architecture and the experimental results validate the claim: powerful vision transformer can achieve ultra-fast inference speed on the edge.

An Image Patch is a Wave: Quantum Inspired Vision MLP

Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.

An Image Patch is a Wave: Phase-Aware Vision MLP

  • Yehui TangKai Han Yunhe Wang
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.

MetaFormer is Actually What You Need for Vision

It is argued that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks, and calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules.

PatchCleanser: Certifiably Robust Defense against Adversarial Patches for Any Image Classifier

It is proved that PatchCleanser will always predict the correct class labels on certain images against any adaptive white-box attacker within the authors' threat model, achieving certified robustness.

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

This work presents Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition that separately encodes the feature representations along the height and width dimensions with linear projections to capture long-range dependencies and avoid the attention building process in transformers.

Rethinking Out-of-Distribution Detection From a Human-Centric Perspective

It is shown that the simple baseline OOD detection method can achieve comparable and even better performance than the recently proposed methods, which means that the development in OOD Detection in the past years may be overestimated.

EurNet: Efficient Multi-Range Relational Modeling of Spatial Multi-Relational Data

The results demonstrate the strength of EurNets on modeling spatial multi-relational data from various domains and follow the augmentation functions and mixup strategies used in Swin Transformer.

Assaying Out-Of-Distribution Generalization in Transfer Learning

A view of previous work is taken, highlighting message discrepancies that are addressed empirically, and providing recommendations on how to measure the robustness of a model and how to improve it, to gain a broader insight in the sometimes contradicting statements on OOD robustness in previous research.

Which models are innately best at uncertainty estimation?

Strong empirical evidence is provided showing that distillation-based training regimes consistently yield better uncertainty estimations than other training schemes such as vanilla training, pretraining on a larger dataset and adversarial training, and that ViT is by far the most superior architecture in terms of uncertainty estimation performance, judging by any aspect, in both in-distribution and class-out-of-dist distribution scenarios.



ImageNet Large Scale Visual Recognition Challenge

The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.

Going deeper with Image Transformers

This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.

Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Pay Attention to MLPs

This work proposes a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and shows that it can perform as well as Transformers in key language and vision applications and can scale as much as Transformers over increased data and compute.

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

This short report replaces the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension, and results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought.

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

It is highlighted that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance and those with aligned images and positional patterns.