Corpus ID: 219531480

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

@article{Wu2020VisualTT,
  title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision},
  author={B. Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and M. Tomizuka and K. Keutzer and P{\'e}ter Vajda},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.03677}
}
Computer vision has achieved great success using standardized image representations -- pixel arrays, and the corresponding deep learning operators -- convolutions. In this work, we challenge this paradigm: we instead (a) represent images as a set of visual tokens and (b) apply visual transformers to find relationships between visual semantic concepts. Given an input image, we dynamically extract a set of visual tokens from the image to obtain a compact representation for high-level semantics… Expand
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Multiscale Vision Transformers
AAformer: Auto-Aligned Transformer for Person Re-Identification
LambdaNetworks: Modeling Long-Range Interactions Without Attention
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 53 REFERENCES
Local Relation Networks for Image Recognition
Image Transformer
Attention to Scale: Scale-Aware Semantic Image Segmentation
Attention Augmented Convolutional Networks
A2-Nets: Double Attention Networks
Pixel-Adaptive Convolutional Neural Networks
LatentGNN: Learning Efficient Non-local Relations for Visual Recognition
Deep Residual Learning for Image Recognition
...
1
2
3
4
5
...