Geometry meets semantics for semi-supervised monocular depth estimation

@inproceedings{Ramirez2018GeometryMS,
  title={Geometry meets semantics for semi-supervised monocular depth estimation},
  author={Pierluigi Zama Ramirez and Matteo Poggi and Fabio Tosi and S. Mattoccia and Luigi di Stefano},
  booktitle={ACCV},
  year={2018}
}
Depth estimation from a single image represents a very exciting challenge in computer vision. While other image-based depth sensing techniques leverage on the geometry between different viewpoints (e.g., stereo or structure from motion), the lack of these cues within a single image renders ill-posed the monocular depth estimation task. For inference, state-of-the-art encoder-decoder architectures for monocular depth estimation rely on effective feature representations learned at training time… 

Transferring knowledge from monocular completion for self-supervised monocular depth estimation

TLDR
This paper proposes a novel framework utilizingmonocular depth completion as an auxiliary task to assist monocular depth estimation, and achieves superior performance to state-of-the-art self-supervised methods and comparable performance with supervised methods.

Semi-Supervised Monocular Depth Estimation Based on Semantic Supervision

TLDR
Experiments on the KITTI dataset show that learning semantic information from images can effectively improve the effect of monocular depth estimation, and SE-Net is superior to the most advanced methods in depth estimation accuracy.

Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge

TLDR
This paper proposes monoResMatch, a novel deep architecture designed to infer depth from a single input image by synthesizing features from a different point of view, horizontally aligned with the input image, performing stereo matching between the two cues and shows how obtaining proxy ground truth annotation through traditional stereo algorithms enables more accurate monocular depth estimation.

Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment

TLDR
This paper proposes to jointly optimize the scene depth and camera motion via incorporating differentiable Bundle Adjustment (BA) layer by minimizing the feature-metric error, and then form the photometric consistency loss with view synthesis as the final supervisory signal.

Learning Depth via Leveraging Semantics: Self-supervised Monocular Depth Estimation with Both Implicit and Explicit Semantic Guidance

TLDR
A Semanticaware Spatial Feature Alignment (SSFA) scheme to effectively align implicit semantic features with depth features for scene-aware depth estimation and a semantic-guided ranking loss to explicitly constrain the estimated depth maps to be consistent with real scene contextual properties are proposed.

Distilled Semantics for Comprehensive Scene Understanding from Videos

TLDR
This paper addresses the three tasks jointly by a novel training protocol based on knowledge distillation and self-supervision and a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms.

Self-supervised Monocular Depth Estimation with Semantic-aware Depth Features

TLDR
This paper introduces multi-task learning schemes to incorporate semantic-awareness into the representation of depth features and proposes SAFENet that is designed to leverage semantic information to overcome the limitations of the photometric loss.

S3Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data

TLDR
This work presents \(S^3\)Net, a self-supervised framework which combines synthetic and real-world images for training while exploiting geometric, temporal, as well as semantic constraints, and provides a new state-of-the-art in self- supervised depth estimation using monocular videos.

Monocular depth estimation based on deep learning: An overview

TLDR
This review surveys the current monocular depth estimation methods based on deep learning according to different training manners and concludes several widely used datasets and evaluation indicators in deep learning-based depth estimation.

Generative Adversarial Networks for Unsupervised Monocular Depth Prediction

TLDR
This proposal is the first successful attempt to tackle monocular depth estimation with a GAN paradigm and the extensive evaluation on CityScapes and KITTI datasets confirm that it enables to improve traditional approaches.
...

References

SHOWING 1-10 OF 47 REFERENCES

Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions

TLDR
This paper introduces a novel interleaved training procedure enabling to enforce the trinocular assumption outlined from current binocular datasets, and outperforms state-of-the-art methods for unsupervised monocular depth estimation trained on binocular stereo pairs as well as any known methods relying on other cues.

Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints

TLDR
The main contribution is to explicitly consider the inferred 3D geometry of the whole scene, and enforce consistency of the estimated 3D point clouds and ego-motion across consecutive frames, and outperforms the state-of-the-art for both breadth and depth.

Unsupervised Monocular Depth Estimation with Left-Right Consistency

TLDR
This paper proposes a novel training objective that enables the convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data, and produces state of the art results for monocular depth estimation on the KITTI driving dataset.

Semi-Supervised Deep Learning for Monocular Depth Map Prediction

TLDR
This paper proposes a novel approach to depth map prediction from monocular images that learns in a semi-supervised way and uses sparse ground-truth depth for supervised learning, and also enforces the deep network to produce photoconsistent dense depth maps in a stereo setup using a direct image alignment loss.

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

TLDR
This paper employs two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally, and applies a scale-invariant error to help measure depth relations rather than scale.

GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose

TLDR
An adaptive geometric consistency loss is proposed to increase robustness towards outliers and non-Lambertian regions, which resolves occlusions and texture ambiguities effectively and achieves state-of-the-art results in all of the three tasks, performing better than previously unsupervised methods and comparably with supervised ones.

Towards Real-Time Unsupervised Monocular Depth Estimation on CPU

TLDR
This paper proposes a novel architecture capable to quickly infer an accurate depth map on a CPU, even of an embedded system, using a pyramid of features extracted from a single input image, and is the first method enabling such performance on CPUs paving the way for effective deployment of unsupervised monocular depth estimation even on embedded systems.

Generative Adversarial Networks for Unsupervised Monocular Depth Prediction

TLDR
This proposal is the first successful attempt to tackle monocular depth estimation with a GAN paradigm and the extensive evaluation on CityScapes and KITTI datasets confirm that it enables to improve traditional approaches.

Deeper Depth Prediction with Fully Convolutional Residual Networks

TLDR
A fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps is proposed and a novel way to efficiently learn feature map up-sampling within the network is presented.

Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

TLDR
This work proposes a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths, and shows that this network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for singleView depth estimation.