Corpus ID: 234470051

Segmenter: Transformer for Semantic Segmentation

@article{Strudel2021SegmenterTF,
  title={Segmenter: Transformer for Semantic Segmentation},
  author={Robin A. M. Strudel and Ricardo Garcia and I. Laptev and C. Schmid},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.05633}
}
Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution-based methods, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings… Expand
Per-Pixel Classification is Not All You Need for Semantic Segmentation
TLDR
The proposed MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction, simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. Expand
Efficient Training of Visual Transformers with Small-Size Datasets
TLDR
This paper empirically analyse different VTs, comparing their robustness in a small training-set regime, and proposes a self-supervised task which can extract additional information from images with only a negligible computational overhead and can improve (sometimes dramatically) the final accuracy of the VTs. Expand
VOLO: Vision Outlooker for Visual Recognition
TLDR
A novel outlook attention is introduced and presented, termed Vision Outlooker (VOLO), which efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Expand
Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance
Transparent objects, such as glass walls and doors, constitute architectural obstacles hindering the mobility of people with low vision or blindness. For instance, the open space behind glass doorsExpand
Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation
Histopathology has played an essential role in cancer diagnosis. With the rapid advances in convolutional neural networks (CNN). Various CNN-based automated pathological image segmentation approachesExpand
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
TLDR
The Cross-Shaped Window self-attention mechanism for computing self-Attention in the horizontal and vertical stripes in parallel that form a cross-shaped window is developed, with each stripe obtained by splitting the input feature into stripes of equal width. Expand
A Survey on Vision Transformer
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representationExpand
Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World
TLDR
A wearable system with a novel dual-head Transformer for Transparency (Trans4Trans) model, capable of segmenting general and transparent objects and performing real-time wayfinding to assist people walking alone more safely. Expand
ViTGAN: Training GANs with Vision Transformers
TLDR
This paper integrates the ViT architecture into generative adversarial networks (GANs) and introduces novel regularization techniques for training GANs with ViTs, achieving comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets. Expand
Focal Self-attention for Local-Global Interactions in Vision Transformers
TLDR
A new variant of Vision Transformer models, called Focal Transformer, is proposed, which achieves superior performance over the state-of-the-art (SoTA) vision Transformers on a range of public image classification and object detection benchmarks. Expand
...
1
2
...

References

SHOWING 1-10 OF 69 REFERENCES
Fully Convolutional Networks for Semantic Segmentation
TLDR
It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Expand
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
TLDR
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. Expand
Context Prior for Scene Segmentation
TLDR
This work develops a Context Prior with the supervision of the Affinity Loss, an effective Context Prior Network that can selectively capture the intra-class and inter-class contextual dependencies, leading to robust feature representation. Expand
Rethinking Atrous Convolution for Semantic Image Segmentation
TLDR
The proposed `DeepLabv3' system significantly improves over the previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark. Expand
Recurrent Convolutional Neural Networks for Scene Labeling
TLDR
This work proposes an approach that consists of a recurrent convolutional neural network which allows us to consider a large input context while limiting the capacity of the model, and yields state-of-the-art performance on both the Stanford Background Dataset and the SIFT FlowDataset while remaining very fast at test time. Expand
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
TLDR
This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Expand
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
TLDR
This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. Expand
Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes
TLDR
This work proposes a novel ResNet-like architecture that exhibits strong localization and recognition performance, and combines multi-scale context with pixel-level accuracy by using two processing streams within the network. Expand
RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation
TLDR
RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner. Expand
Gated Feedback Refinement Network for Dense Image Labeling
TLDR
This paper proposes Gated Feedback Refinement Network (G-FRNet), an end-to-end deep learning framework for dense labeling tasks that addresses this limitation of existing methods and introduces gate units that control the information passed forward in order to filter out ambiguity. Expand
...
1
2
3
4
5
...