FoveaNet: Perspective-Aware Urban Scene Parsing

  title={FoveaNet: Perspective-Aware Urban Scene Parsing},
  author={X. Li and Zequn Jie and Wei Wang and Changsong Liu and Jimei Yang and Xiaohui Shen and Zhe L. Lin and Qiang Chen and Shuicheng Yan and Jiashi Feng},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  • X. Li, Zequn Jie, Jiashi Feng
  • Published 8 August 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
Parsing urban scene images benefits many applications, especially self-driving. Most of the current solutions employ generic image parsing models that treat all scales and locations in the images equally and do not consider the geometry property of car-captured urban scene images. Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition… 

Figures and Tables from this paper

VRT-Net: Real-Time Scene Parsing via Variable Resolution Transform
The proposed framework is designed to enable its usage as a wrapper over the available real-time scene parsing models, thereby demonstrating a superior trade-off between speed and quality as compared to the prior state-of-the-arts.
Perspective-Adaptive Convolutions for Scene Parsing
This work proposes perspective-adaptive convolutions to acquire receptive fields of flexible sizes and shapes during scene parsing through adding a new perspective regression layer, which can dynamically infer the position- Adaptive perspective coefficient vectors utilized to reshape the convolutional patches.
Semantic Segmentation for Urban-Scene Images
This project seeks an advanced and integrated solution that specifically targets urban-scene image semantic segmentation among the most novel approaches in the current field and finds that the two-step integrated model improves the mean Intersection-Over-Union (mIoU) score gradually from the baseline model.
Cars Can’t Fly Up in the Sky: Improving Urban-Scene Segmentation via Height-Driven Attention Networks
  • Sungha Choi, J. Kim, J. Choo
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This paper exploits the intrinsic features of urban-scene images and proposes a general add-on module, called height-driven attention networks (HANet), for improving semantic segmentation for urban- scene images, and achieves a new state-of-the-art performance on the Cityscapes benchmark with a large margin among ResNet101 based segmentation models.
Boosting Real-Time Driving Scene Parsing With Shared Semantics
It is demonstrated that sharing semantics between cameras with different perspectives and overlapping views can boost the parsing performance when compared with traditional methods, which individually process the frames from each camera.
Consensus Feature Network for Scene Parsing
The Consensus Feature Network (CFNet) is presented, based on the proposed ICT and CCT units, and achieves competitive performance on four datasets, including Cityscapes, Pascal Context, CamVid, and COCO Stuff.
FBNet: Feature Balance Network for Urban-Scene Segmentation
This paper presents a novel add-on module, named Feature Balance Network (FBNet), to eliminate the feature camouflage in urban-scene segmentation, and achieves a new state-of-the-art segmentation performance on two challenging urban- scene benchmarks, i.e., Cityscapes and BDD100K.
Learning a Layout Transfer Network for Context Aware Object Detection
This work presents a context aware object detection method based on a retrieve-and-transform scene layout model that provides consistent performance improvements to the state-of-the-art object detection baselines on a variety of challenging tasks in the traffic surveillance and the autonomous driving domains.
Small Object Sensitive Segmentation of Urban Street Scene With Spatial Adjacency Between Object Classes
This paper proposes a new boundary-based metric that measures the level of spatial adjacency between each pair of object classes and finds that this metric is robust against object size-induced biases and develops a new method to enforce this metric into the segmentation loss.


The Cityscapes Dataset for Semantic Urban Scene Understanding
This work introduces Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling, and exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity.
Combining Appearance and Structure from Motion Features for Road Scene Understanding
A framework for pixel-wise object segmentation of road scenes that combines motion and appearance features that is designed to handle street-level imagery such as that on Google Street View and Microsoft Bing Maps is presented.
Putting Objects in Perspective
This paper provides a framework for placing local object detection in the context of the overall 3D scene by modeling the interdependence of objects, surface orientations, and camera viewpoint by allowing probabilistic object hypotheses to refine geometry and vice-versa.
Learning Hierarchical Features for Scene Labeling
A method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel, alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information.
Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation
This work shows how to improve semantic segmentation through the use of contextual information, specifically, ' patch-patch' context between image regions, and 'patch-background' context, and formulate Conditional Random Fields with CNN-based pairwise potential functions to capture semantic correlations between neighboring patches.
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
This paper employs two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally, and applies a scale-invariant error to help measure depth relations rather than scale.
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net
This work proposes a "Hierarchical Auto-Zoom Net" (HAZN) for object part parsing which adapts to the local scales of objects and parts and significantly outperforms the state-of-the-arts by 5% mIOU and is especially better at segmenting small instances and small parts.
DAG-Recurrent Neural Networks for Scene Labeling
Direct acyclic graph RNNs are proposed to process DAG-structured images, which enables the network to model long-range semantic dependencies among image units and proposes a novel class weighting function that attends to rare classes, which phenomenally boosts the recognition accuracy for non-frequent classes.