• Corpus ID: 237452623

ConvMLP: Hierarchical Convolutional MLPs for Vision

  title={ConvMLP: Hierarchical Convolutional MLPs for Vision},
  author={Jiachen Li and Ali Hassani and Steven Walton and Humphrey Shi},
MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods. However, most adopt spatial MLPs which take fixed dimension inputs, therefore making it difficult to apply them to downstream tasks, such as object detection and semantic segmentation. Moreover, single-stage designs further limit performance in other computer vision tasks and fully connected layers… 

Figures and Tables from this paper

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

This paper presents an efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion, and proposes a procedure to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed.

MDMLP: Image Classification from Scratch on Small Datasets with MLP

A conceptually simple and lightweight MLP-based architecture yet achieves SOTA when training from scratch on small-size datasets; and a novel and efficient attention mechanism based on MLPs that high-lights objects in images, indicating its explanation power.

UniNet: Unified Architecture Search with Convolution, Transformer, and MLP

This work studies the learnable combination of convolution, transformer, and MLP by proposing a novel unified architecture search approach, and proposes context-aware downsampling modules (DSMs) to mitigate the gap between the different types of operators.

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation and the problem of fixed input image resolution for global MLPs-based models is tackled by utilizing bicubic interpolation.

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

To exploit both global and local dependencies without self-attention, this paper presents Mix-Shift-MLP (MS- MLP) which makes the size of the local receptive field used for mixing increase in respect to the amount of spatial shifting.

LKD-Net: Large Kernel Convolution Network for Single Image Dehazing

A novel Large Ker- nel Convolution Dehaze Block (LKD Block) consisting of the Decomposition deep-wise Large Kernel Convolution Block (DLKCB) and the Channel Enhanced Feed-forward Network (CEFN) is devised in this paper.

SplitMixer: Fat Trimmed From MLP-like Models

It is shown, both theoretically and experimentally, that SplitMixer performs on par with the state-of-the-art MLP-like models while having a significantly lower number of parameters and FLOPS.

SWAT: Spatial Structure Within and Among Tokens

This paper argues that models can have significant gains when spatial structure is preserved during tokenization, and is explicitly used during the mixing stage, and proposes two key contributions: structure-aware Tokenization and Structure-aware Mixing, both of which can be combined with existing models with minimal effort.

HrreNet: semantic segmentation network for moderate and high-resolution satellite images

A segmentation network, namely, High-resolution Resource Extracting Network (HrreNet), is proposed by using high-resolution feature representation, multi-scale context fusion, boundary refinement with relearning, and structural similarity loss to improve performance on small- size objects and slightly improves performance on larger-size objects.



S2-MLP: Spatial-Shift MLP Architecture for Vision

This paper proposes a novel pure MLP architecture, spatial-shift MLP (S2-MLP), which accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

AS-MLP: An Axial Shifted MLP Architecture for Vision

An Axial Shifted MLP architecture (AS-MLP), which is the first MLP-based architecture to be applied to the downstream tasks and achieves competitive performance compared to the transformer-based architectures even with slightly lower FLOPs.

CycleMLP: A MLP-like Architecture for Dense Prediction

CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for MLP models, and expands the MLPlike models’ applicability, making them a versatile backbone for dense prediction tasks.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

Densely Connected Convolutional Networks

The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.