Corpus ID: 234742218

Pay Attention to MLPs

@article{Liu2021PayAT,
  title={Pay Attention to MLPs},
  author={Hanxiao Liu and Zihang Dai and David R. So and Quoc V. Le},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.08050}
}
Transformers [1] have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model… Expand
An Attention Free Transformer
TLDR
Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention, is introduced and demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time. Expand
RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
TLDR
It is indicated that MLP-based models have the potential to replace CNNs by adopting inductive bias and the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. Expand
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
TLDR
The aim is to improve the models’ data efficiency at training and generalization at inference, and substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning. Expand
Multi-Exit Vision Transformer for Dynamic Inference
TLDR
This work proposes seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones and shows that each one of these architectures could prove useful in the trade-off between accuracy and speed. Expand
Rethinking Token-Mixing MLP for MLP-based Vision Backbone
TLDR
It is discovered that token-mixing MLPs in existing MLP-based backbones are spatial-specific, and thus it is sensitive to spatial translation, and an improved structure is proposed termed as Circulant Channel-Specific (CCS) token- Mixing MLP, which is spatialinvariant and channel-specific. Expand
Global Filter Networks for Image Classification
TLDR
The Global Filter Network is presented, a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity and can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness. Expand
S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
  • Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li
  • Computer Science
  • ArXiv
  • 2021
TLDR
This paper improves the S-MLP vision backbone by expanding the feature map along the channel dimension and split the expanded feature map into several parts, and exploiting the split-attention operation to fuse these split parts. Expand
S2-MLP: Spatial-Shift MLP Architecture for Vision
TLDR
A novel pure MLP architecture, spatial-shift MLP (S-MLP), which accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters. Expand
ConvMLP: Hierarchical Convolutional MLPs for Vision
MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods.Expand
Towards Biologically Plausible Convolutional Networks
TLDR
This work proposes to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity, which enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutionals networks as a model of the visual stream. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 46 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture, named Convolutional vision Transformer (CvT), is presented, that improves Vision Trans transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Language Modeling with Gated Convolutional Networks
TLDR
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks. Expand
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual RecognitionExpand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
TLDR
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. Expand
Pay Less Attention with Lightweight and Dynamic Convolutions
TLDR
It is shown that a very lightweight convolution can perform competitively to the best reported self-attention results, and dynamic convolutions are introduced which are simpler and more efficient than self-ATTention. Expand
Synthesizer: Rethinking Self-Attention in Transformer Models
TLDR
The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand
...
1
2
3
4
5
...