Recurrent Scene Parsing with Perspective Understanding in the Loop

  title={Recurrent Scene Parsing with Perspective Understanding in the Loop},
  author={Shu Kong and Charless C. Fowlkes},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
Objects may appear at arbitrary scales in perspective images of a scene, posing a challenge for recognition systems that process images at a fixed resolution. We propose a depth-aware gating module that adaptively selects the pooling field size in a convolutional network architecture according to the object scale (inversely proportional to the depth) so that small details are preserved for distant objects while larger receptive fields are used for those nearby. The depth gating signal is… 

Figures and Tables from this paper

Pixel-Wise Attentional Gating for Scene Parsing

This work extensively evaluates Pixel-wise Attentional Gating on a variety of per-pixel labeling tasks, including semantic segmentation, boundary detection, monocular depth and surface normal estimation, and demonstrates that PAG allows competitive or state-of-the-art performance on these tasks.

SPGNet: Semantic Prediction Guidance for Scene Parsing

By carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with the proposed Semantic Prediction Guidance (SPG) module can significantly outperform its one-stage counterpart with similar parameters and computations.

Rethinking Atrous Convolution for Semantic Image Segmentation

The proposed `DeepLabv3' system significantly improves over the previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

Encoder–decoder with double spatial pyramid for semantic segmentation

A network with an encoder–decoder architecture based on two proposed modules: global pyramid attention module (GPAM) and pyramid decoder module (PDM) that achieves a mean intersection over union score of 83.4% on PASCAL VOC 2012 dataset and 78.5% on Cityscapes dataset.

Variational Context-Deformable ConvNets for Indoor Scene Parsing

A novel variational context-deformable (VCD) module to learn adaptive receptive-field in a structured fashion and a perspective-aware guidance module is designed to take advantage of multi-modal information for RGB-D segmentation.

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Novel deep dual-resolution networks (DDRNets) are proposed for real-time semantic segmentation of road scenes and a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) is designed to enlarge effective receptive fields and fuse multi-scale context.

Malleable 2.5D Convolution: Learning Receptive Fields along the Depth-axis for RGB-D Scene Parsing

This paper proposes a novel operator called malleable 2.5D convolution to learn the receptive field along the depth-axis, formulated as a differentiable form so that it can be learnt by gradient descent.

Gated Fully Fusion for Semantic Segmentation

This paper proposes a new architecture, named Gated Fully Fusion(GFF), to selectively fuse features from multiple levels using gates in a fully connected way, and achieves the state of the art results on four challenging scene parsing datasets including Cityscapes, Pascal Context, COCO-stuff and ADE20K.



Deeper Depth Prediction with Fully Convolutional Residual Networks

A fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps is proposed and a novel way to efficiently learn feature map up-sampling within the network is presented.

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

This paper employs two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally, and applies a scale-invariant error to help measure depth relations rather than scale.

Attention to Scale: Scale-Aware Semantic Image Segmentation

An attention mechanism that learns to softly weight the multi-scale features at each pixel location is proposed, which not only outperforms averageand max-pooling, but allows us to diagnostically visualize the importance of features at different positions and scales.

Fully convolutional networks for semantic segmentation

The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.

Pyramid Scene Parsing Network

This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation

A multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps is described.

Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation

This work shows how to improve semantic segmentation through the use of contextual information, specifically, ' patch-patch' context between image regions, and 'patch-background' context, and formulate Conditional Random Fields with CNN-based pairwise potential functions to capture semantic correlations between neighboring patches.

Deep convolutional neural fields for depth estimation from a single image

A deep structured learning scheme which learns the unary and pairwise potentials of continuous CRF in a unified deep CNN framework and can be used for depth estimations of general scenes with no geometric priors nor any extra information injected.

FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture

This paper proposes an encoder-decoder type network, where the encoder part is composed of two branches of networks that simultaneously extract features from RGB and depth images and fuse depth features into the RGB feature maps as the network goes deeper.