CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation

  title={CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation},
  author={Qihang Yu and Huiyu Wang and Dahun Kim and Siyuan Qiao and Maxwell D. Collins and Yukun Zhu and Hartwig Adam and Alan Loddon Yuille and Liang-Chieh Chen},
In the supplementary materials, we provide more technical details, along with more ablation and comparison results with other concurrent works. We also include more visualizations and comparisons over the baselines. Addi-tionally, we provide a comprehensive comparison, in terms of training epochs, memory cost, parameters, FLOPs, and FPS, across different methods. We also report results with a ResNet-50 backbone for a fair comparison across different methods, along with additional results on… 

k-means Mask Transformer

The relationship between pixels and object queries is rethink, and a k -means clustering algorithm is proposed to reformulate the cross-attention learning as a clustering process, which improves the state-of-the-art, but also enjoys a simple and elegant design.

TubeFormer-DeepLab: Video Mask Transformer

TubeFormer-DeepLab is presented, the first attempt to tackle multiple core video segmentation tasks in a unified manner and directly predicts video tubes with task-specific labels, which not only significantly simplifiesVideo segmentation models, but also advances state-of-theart results on multiple video segmentations benchmarks.

Panoramic Panoptic Segmentation: Insights Into Surrounding Parsing for Mobile Agents via Unsupervised Contrastive Learning

This work introduces panoramic panoptic segmentation, as the most holistic scene understanding, both in terms of Field of View (FoV) and image-level understanding for standard camera-based input and proposes a framework which allows model training on standard pinhole images and transfers the learned features to a different domain in a cost-minimizing way.

DETRs with Hybrid Matching

A simple yet effective method based on a hybrid matching scheme that combines the original one-to-one matching branch with auxiliary queries that use one- to-many matching loss during training to improve training efficiency and improve accuracy is proposed.



Fully Convolutional Networks for Panoptic Segmentation

This approach aims to represent and predict foreground things and background stuff in a unified fully convolutional pipeline and outperforms previous box-based and -free models with high efficiency on COCO, Cityscapes, and Mapillary Vistas datasets with single scale input.

UPSNet: A Unified Panoptic Segmentation Network

A parameter-free panoptic head is introduced which solves thepanoptic segmentation via pixel-wise classification and first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolving the conflicts between semantic and instance segmentation.


The single Panoptic-DeepLab sets the new state-of-art at all three Cityscapes benchmarks, reaching 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set, and advances results on the other challenging Mapillary Vistas.

Panoptic Feature Pyramid Networks

This work endsow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone, and shows it is a robust and accurate baseline for both tasks.

Seamless Scene Segmentation

This work introduces a novel, CNN-based architecture that can be trained end-to-end to deliver seamless scene segmentation results by means of a panoptic output format, going beyond the simple combination of independently trained segmentation and detection models.

An End-To-End Network for Panoptic Segmentation

A novel end-to-end Occlusion Aware Network (OANet) for panoptic segmentation is proposed, which can efficiently and effectively predict both the instance and stuff segmentation in a single network and introduces a novel spatial ranking module to deal with the occlusion problem between the predicted instances.

SOLOv2: Dynamic and Fast Instance Segmentation

State-of-the-art results in object detection (from the authors' mask byproduct) and panoptic segmentation show the potential to serve as a new strong baseline for many instance-level recognition tasks besides instance segmentation.

Unifying Training and Inference for Panoptic Segmentation

An end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation, a task that seeks to partition an image into semantic regions for "stuff" and object instances for "things", which is of interest for applications with computation budgets.

Segmenter: Transformer for Semantic Segmentation

This paper introduces Segmenter, a transformer model for semantic segmentation that outperforms the state of the art on both ADE20K and Pascal Context datasets and is competitive on Cityscapes.

Rethinking Atrous Convolution for Semantic Image Segmentation

The proposed `DeepLabv3' system significantly improves over the previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.