• Corpus ID: 236154781

CycleMLP: A MLP-like Architecture for Dense Prediction

@article{Chen2022CycleMLPAM,
  title={CycleMLP: A MLP-like Architecture for Dense Prediction},
  author={Shoufa Chen and Enze Xie and Chongjian Ge and Ding Liang and Ping Luo},
  journal={ArXiv},
  year={2022},
  volume={abs/2107.10224}
}
This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions, unlike modern MLP architectures, e.g., MLP-Mixer [49], ResMLP [50], and gMLP [35], whose architectures are correlated to image size and thus are infeasible in object detection and segmentation. CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by… 

Hire-MLP: Vision MLP via Hierarchical Rearrangement

Hire-MLP is presented, a simple yet competitive vision MLP architecture via Hierarchical rearrangement, which contains two levels of rearrangements to enable information communication between different regions and capture global context by circularly shifting all tokens along spatial directions.

RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality

A methodology, Locality Injection, to incorporate local priors into an FC layer via merging the trained parameters of a parallel conv kernel into the FC kernel, and a novel architecture named RepMLPNet, which uses three FC layers to extract features, and is the first MLP that seamlessly transfer to Cityscapes semantic segmentation.

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

This paper presents an efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion, and proposes a procedure to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed.

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

A novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the highlevel layers is proposed, which can be as powerful as and even outperform self-attention based models.

AS-MLP: An Axial Shifted MLP Architecture for Vision

An Axial Shifted MLP architecture (AS-MLP), which is the first MLP-based architecture to be applied to the downstream tasks and achieves competitive performance compared to the transformer-based architectures even with slightly lower FLOPs.

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

MLP-3D networks are presented, a novel MLP-like 3D architecture for video recognition without the dependence on convolutions or attention mechanisms, and achieves 68.5%/81.4% top-1 accuracy on Something-Something V2 and Kinetics-400 datasets, respectively.

MDMLP: Image Classification from Scratch on Small Datasets with MLP

A conceptually simple and lightweight MLP-based architecture yet achieves SOTA when training from scratch on small-size datasets; and a novel and efficient attention mechanism based on MLPs that high-lights objects in images, indicating its explanation power.

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation and the problem of fixed input image resolution for global MLPs-based models is tackled by utilizing bicubic interpolation.

S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

This paper improves the S-MLP vision backbone by expanding the feature map along the channel dimension and split the expanded feature map into several parts, and exploiting the split-attention operation to fuse these split parts.

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

This paper clusters and then aggregate key and value tokens, as a content-based method of reducing the total token count, and extends the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks.
...

References

SHOWING 1-10 OF 91 REFERENCES

AS-MLP: An Axial Shifted MLP Architecture for Vision

An Axial Shifted MLP architecture (AS-MLP), which is the first MLP-based architecture to be applied to the downstream tasks and achieves competitive performance compared to the transformer-based architectures even with slightly lower FLOPs.

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

Bottleneck Transformers for Visual Recognition

BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented.

Aggregated Residual Transformations for Deep Neural Networks

On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.

Rethinking the Inception Architecture for Computer Vision

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

ResMLP: Feedforward networks for image classification with data-efficient training

ResMLP is a simple residual network that alternates a linear layer in which image patches interact, independently and identically across channels, and a two-layer feed-forward network in which channels interact independently per patch that attains surprisingly good accuracy/complexity trade-offs on ImageNet.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Global Filter Networks for Image Classification

The Global Filter Network is presented, a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity and can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
...