• Corpus ID: 235694359

Global Filter Networks for Image Classification

@inproceedings{Rao2021GlobalFN,
  title={Global Filter Networks for Image Classification},
  author={Yongming Rao and Wenliang Zhao and Zheng Zhu and Jiwen Lu and Jie Zhou},
  booktitle={NeurIPS},
  year={2021}
}
Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter… 
MAXIM: Multi-Axis MLP for Image Processing
TLDR
The proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models.
Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
TLDR
To exploit both global and local dependencies without self-attention, this paper presents Mix-Shift-MLP (MS- MLP) which makes the size of the local receptive field used for mixing increase in respect to the amount of spatial shifting.
An Image Patch is a Wave: Phase-Aware Vision MLP
TLDR
Based on the wave-like token representation, a novel WaveMLP architecture is established that is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.
Sequencer: Deep LSTM for Image Classification
TLDR
This work proposes Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on what inductive bias is suitable for computer vision, and models long-range dependencies using LSTMs rather than self-attention layers.
An Image Patch is a Wave: Quantum Inspired Vision MLP
TLDR
Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.
S2-MLP: Spatial-Shift MLP Architecture for Vision
TLDR
This paper proposes a novel pure MLP architecture, spatial-shift MLP (S2-MLP), which accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
TLDR
The Recursive Gated Convolution ( g n Conv) is presented that performs high-order spatial interactions with gated convolutions and recursive designs and is demonstrated to be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs.
S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
TLDR
This paper improves the S-MLP vision backbone by expanding the feature map along the channel dimension and split the expanded feature map into several parts, and exploiting the split-attention operation to fuse these split parts.
FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization
TLDR
This work proposes a novel frequency-aware MLP architecture, in which the domain-specific features are filtered out in the transformed frequency domain, augmenting the invariant descriptor for label prediction, and is the first to propose a MLP-like backbone for domain generalization.
Hire-MLP: Vision MLP via Hierarchical Rearrangement
TLDR
Hire-MLP is presented, a simple yet competitive vision MLP architecture via Hi erarchical re arrangement, which contains two lev-els of rearrangements to enable information communication between different regions and capture global context by circularly shifting all tokens along spatial directions.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
MLP-Mixer: An all-MLP Architecture for Vision
TLDR
It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.
Fast Fourier Convolution
TLDR
Fast Fourier convolution (FFC) is a generic operator that can directly replace vanilla convolutions in a large body of existing networks, without any adjustments and with comparable complexity metrics (e.g., FLOPs).
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Aggregated Residual Transformations for Deep Neural Networks
TLDR
On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.
ResMLP: Feedforward networks for image classification with data-efficient training
TLDR
ResMLP is a simple residual network that alternates a linear layer in which image patches interact, independently and identically across channels, and a two-layer feed-forward network in which channels interact independently per patch.
Improve Vision Transformers Training by Suppressing Over-smoothing
TLDR
This work investigates how to stabilize the training of vision transformers without special structure modification, and proposes a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix.
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
TLDR
The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Going deeper with Image Transformers
TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.
...
...