Masked-attention Mask Transformer for Universal Image Segmentation

  title={Masked-attention Mask Transformer for Universal Image Segmentation},
  author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts… 

Pyramid Fusion Transformer for Semantic Segmentation

This study finds that per-Mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probability or mask, and proposes a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation onTop of multi-scale features.

Open-Vocabulary Panoptic Segmentation with MaskCLIP

A Relative Mask Attention (RMA) module is designed to account for segmentations as additional tokens to the ViT CLIP model to perform semantic segmentation and object instance segmentation in open-vocabulary panoptic segmentation.

Clustering as Attention: Unified Image Segmentation with Hierarchical Clustering

This work proposes a hierarchical clustering-based image segmentation scheme for deep neural networks, called HCFormer, which removes the pixel decoder before the segmentation head and simplifies the segmentsation pipeline, result-ing in improved segmentation accuracies and interpretabil-ity.

Differentiable Soft-Masked Attention

This paper proposes another specialization of attention which enables attending over ‘soft-masks’, and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision.

RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation

This paper proposes to decompose segmentation into two sub-problems: (i) image-level or video-level multi-label classification and (ii) pixel-level rank-adaptive selected- label classification, which can be used to improve various existing segmentation frameworks.

k-means Mask Transformer

The relationship between pixels and object queries is rethink, and a k -means clustering algorithm is proposed to reformulate the cross-attention learning as a clustering process, which improves the state-of-the-art, but also enjoys a simple and elegant design.

MLSeg: Image and Video Segmentation as Multi-Label Classification and Selected-Label Pixel Classification

This paper proposes to decompose segmentation into two sub-problems: (i) image-level or video-level multilabel classification and (ii) pixel-level selected-label classification, which is conceptually general and can be applied to various existing segmentation frameworks by simply adding a lightweight multi- label classification branch.

Mask2Former for Video Instance Segmentation

It is found Mask2Former achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline, and is also capable of handling video semantic and panoptic segmentation.

Visual Attention Network

A novel linear attention named large kernel attention (LKA) is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings and a neural network based on LKA is presented, namely Visual Attention Network (VAN).

StructToken : Rethinking Semantic Segmentation with Structural Prior

This work presents a new paradigm for semantic segmentation, named structure-aware extraction, which aims to progressively extract the structural information of each category from the feature of the image feature.



Panoptic Feature Pyramid Networks

This work endsow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone, and shows it is a robust and accurate baseline for both tasks.

K-Net: Towards Unified Image Segmentation

Without bells and whistles, K-Net surpasses all previous published state-of-the-art single-model results of panoptic segmentation on MS COCO test-dev split and semantic segmentations on ADE20K val split with 55.2% PQ and 54.3% mIoU, respectively.

Segmenter: Transformer for Semantic Segmentation

This paper introduces Segmenter, a transformer model for semantic segmentation that outperforms the state of the art on both ADE20K and Pascal Context datasets and is competitive on Cityscapes.

Panoptic Segmentation

A novel panoptic quality (PQ) metric is proposed that captures performance for all classes (stuff and things) in an interpretable and unified manner and is performed a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task.

Dual Attention Network for Scene Segmentation

New state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset is achieved without using coarse data.

Rethinking Atrous Convolution for Semantic Image Segmentation

The proposed `DeepLabv3' system significantly improves over the previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.

Conditional Convolutions for Instance Segmentation

A simpler instance segmentation method that can achieve improved performance in both accuracy and inference speed on the COCO dataset, and outperform a few recent methods including well-tuned Mask RCNN baselines, without longer training schedules needed.

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

This paper deploys a pure transformer to encode an image as a sequence of patches, termed SEgmentation TRansformer (SETR), and shows that SETR achieves new state of the art on ADE20K, Pascal Context, and competitive results on Cityscapes.