Corpus ID: 232352612

Vision Transformers for Dense Prediction

@article{Ranftl2021VisionTF,
  title={Vision Transformers for Dense Prediction},
  author={Ren{\'e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.13413}
}
We introduce dense prediction transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into fullresolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global… Expand
Fully Transformer Networks for Semantic Image Segmentation
TLDR
This work proposes a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT) to achieve new state-of-the-art results on multiple challenging semantic segmentation benchmarks. Expand
ConvNets vs. Transformers: Whose Visual Representations are More Transferable?
  • Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Yizhou Yu
  • Computer Science
  • ArXiv
  • 2021
TLDR
This work systematically investigates the transfer learning ability of ConvNets and vision transformers in single-task and multi-task performance evaluations and finds that two ViT models heavily rely on whole network fine-tuning to achieve performance gains while Swin Transformer does not have such a requirement. Expand
XCiT: Cross-Covariance Image Transformers
TLDR
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, and has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Expand
Scaled ReLU Matters for Training Vision Transformers
  • Pichao Wang, Xue Wang, +5 authors Rong Jin
  • Computer Science
  • ArXiv
  • 2021
TLDR
It is verified, both theoretically and empirically, that scaled ReLU in the conv-stem matters for the robust ViTs training and not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. Expand
Multi-Exit Vision Transformer for Dynamic Inference
TLDR
This work proposes seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones and shows that each one of these architectures could prove useful in the trade-off between accuracy and speed. Expand
Progress and Proposals: A Case Study of Monocular Depth Estimation
Deep learning has achieved great results and made rapid progress over the past few years, particularly in the field of computer vision. Deep learning models are composed of artificial neural networksExpand
Uformer: A General U-Shaped Transformer for Image Restoration
TLDR
Uformer is presented, an effective and efficient Transformer-based architecture, in which a hierarchical encoder-decoder network is built using the Transformer block for image restoration, and three skip-connection schemes are explored to effectively deliver information from the encoder to the decoder. Expand
ViTGAN: Training GANs with Vision Transformers
TLDR
This paper integrates the ViT architecture into generative adversarial networks (GANs) and introduces novel regularization techniques for training GANs with ViTs, achieving comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets. Expand
Is it Time to Replace CNNs with Transformers for Medical Images?
TLDR
While CNNs perform better when trained from scratch, off-shelf vision transformers using default hyperparameters are on par with CNNs when pretrained on ImageNet, and outperform their CNN counterparts when Pretrained using self-supervision. Expand
Vision Transformer Hashing for Image Retrieval
  • S. Dubey, Satish Kumar Singh, Wei-Ta Chu
  • Computer Science
  • ArXiv
  • 2021
TLDR
The proposed VTS based image retrieval outperforms the recent state-of-the-art hashing techniques with a great margin and is better than the existing networks, such as AlexNet and ResNet. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 61 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
Fully Convolutional Networks for Semantic Segmentation
TLDR
It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Expand
Image Transformer
TLDR
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. Expand
RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation
TLDR
RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
TLDR
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. Expand
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
TLDR
Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet. Expand
Deep High-Resolution Representation Learning for Visual Recognition
TLDR
The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems. Expand
...
1
2
3
4
5
...