• Corpus ID: 246276042

Convolutional Xformers for Vision

@article{Jeevan2022ConvolutionalXF,
  title={Convolutional Xformers for Vision},
  author={Pranav Jeevan and Amit Sethi},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.10271}
}
Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture – Convolutional X-formers for Vision (CXV) – to… 

References

SHOWING 1-10 OF 24 REFERENCES

Vision Xformers: Efficient Attention for Image Classification

TLDR
This work modifications the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers, such as Performer, Linformer and Nyströmformer of linear complexity creating Vision X-formers (ViX), and shows that all three versions of ViX may be more accurate than ViT for image classification while using far fewer parameters and computational resources.

CvT: Introducing Convolutions to Vision Transformers

TLDR
A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.

Patches Are All You Need?

TLDR
The ConvMixer is proposed, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network.

MLP-Mixer: An all-MLP Architecture for Vision

TLDR
It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

TLDR
This work designs a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime and proposes LeViT, a hybrid neural network for fast inference image classification that significantly outperforms existing convnets and vision transformers.

Escaping the Big Data Paradigm with Compact Transformers

TLDR
This paper shows for the first time that with the right size, convolutional tokenization, transformers can avoid over-tting and outperform state-of-the-art CNNs on small datasets and presents an approach for small-scale learning by introducing Compact Transformers.

Tiny ImageNet Visual Recognition Challenge

TLDR
This work investigates the effect of convolutional network depth, receptive field size, dropout layers, rectified activation unit type and dataset noise on its accuracy in Tiny-ImageNet Challenge settings and achieves excellent performance even compared to state-of-the-art results.

Deep Residual Learning for Image Recognition

TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Learning Multiple Layers of Features from Tiny Images

TLDR
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Rethinking Attention with Performers

TLDR
Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.