Corpus ID: 232352612

Vision Transformers for Dense Prediction

@article{Ranftl2021VisionTF,
  title={Vision Transformers for Dense Prediction},
  author={Ren{\'e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.13413}
}
We introduce dense prediction transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into fullresolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global… Expand
Fully Transformer Networks for Semantic Image Segmentation
TLDR
This work proposes a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT) to achieve new state-of-the-art results on multiple challenging semantic segmentation benchmarks. Expand
HRFormer: High-Resolution Transformer for Dense Prediction
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolutionExpand
ConvNets vs. Transformers: Whose Visual Representations are More Transferable?
  • Hong-Yu Zhou, Chi-Ken Lu, Sibei Yang, Yizhou Yu
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
  • 2021
TLDR
This work systematically investigates the transfer learning ability of ConvNets and vision transformers in single-task and multi-task performance evaluations and finds that two ViT models heavily rely on whole network fine-tuning to achieve performance gains while Swin Transformer does not have such a requirement. Expand
XCiT: Cross-Covariance Image Transformers
TLDR
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, and has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Expand
Scaled ReLU Matters for Training Vision Transformers
  • Pichao Wang, Xue Wang, +5 authors Rong Jin
  • Computer Science
  • ArXiv
  • 2021
TLDR
It is verified, both theoretically and empirically, that scaled ReLU in the conv-stem matters for the robust ViTs training and not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. Expand
Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction
  • Jing Zhang, Jianwen Xie, N. Barnes, Ping Li
  • 2021
Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variablesExpand
Progress and Proposals: A Case Study of Monocular Depth Estimation
Deep learning has achieved great results and made rapid progress over the past few years, particularly in the field of computer vision. Deep learning models are composed of artificial neural networksExpand
Uformer: A General U-Shaped Transformer for Image Restoration
TLDR
Uformer is presented, an effective and efficient Transformer-based architecture, in which a hierarchical encoder-decoder network is built using the Transformer block for image restoration, and three skip-connection schemes are explored to effectively deliver information from the encoder to the decoder. Expand
ViTGAN: Training GANs with Vision Transformers
TLDR
This paper integrates the ViT architecture into generative adversarial networks (GANs) and introduces novel regularization techniques for training GANs with ViTs, achieving comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets. Expand
Is it Time to Replace CNNs with Transformers for Medical Images?
TLDR
While CNNs perform better when trained from scratch, off-shelf vision transformers using default hyperparameters are on par with CNNs when pretrained on ImageNet, and outperform their CNN counterparts when Pretrained using self-supervision. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 61 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
Image Transformer
TLDR
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. Expand
Fully Convolutional Networks for Semantic Segmentation
TLDR
It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Expand
RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation
TLDR
RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
TLDR
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. Expand
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
TLDR
Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet. Expand
Deep High-Resolution Representation Learning for Visual Recognition
TLDR
The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems. Expand
...
1
2
3
4
5
...