Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints

@article{Mahjourian2018UnsupervisedLO,
  title={Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints},
  author={Reza Mahjourian and Martin Wicke and Anelia Angelova},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={5667-5675}
}
We present a novel approach for unsupervised learning of depth and ego-motion from monocular video. Unsupervised learning removes the need for separate supervisory signals (depth or ego-motion ground truth, or multi-view video). Prior work in unsupervised depth learning uses pixel-wise or gradient-based losses, which only consider pixels in small local neighborhoods. Our main contribution is to explicitly consider the inferred 3D geometry of the whole scene, and enforce consistency of the… 

Unsupervised Learning of Depth and Ego-Motion from Monocular Video in 3 D

A loss is proposed which directly penalizes inconsistencies in the estimated depth by directly comparing 3D point clouds in a common reference frame, and is solved by a novel (approximate) backpropagation algorithm for aligning 3D structures.

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

This paper proposes a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions and is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale- Consistent camera trajectories over a long video sequence.

Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding

Experiments on KITTI 2015 dataset show that the estimated geometry, 3D motion and moving object masks, not only are constrained to be consistent, but also significantly outperforms other SOTA algorithms, demonstrating the benefits of the approach.

Unsupervised Learning of Monocular Depth and Ego-Motion Using Multiple Masks

A new unsupervised learning method of depth and ego-motion using multiple masks from monocular video is proposed, to carefully consider the occlusion of the pixels generated when the adjacent frames are projected to each other, and the blank problem generated in the projection target imaging plane.

Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment

This paper proposes to jointly optimize the scene depth and camera motion via incorporating differentiable Bundle Adjustment (BA) layer by minimizing the feature-metric error, and then form the photometric consistency loss with view synthesis as the final supervisory signal.

Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras

This work is the first to learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale.

Unsupervised Video Depth Estimation Based on Ego-motion and Disparity Consensus

A novel unsupervised monocular video depth estimation method in natural scenes by taking advantage of the state-of-the-art method of Zhou et al. which jointly estimates depth and camera motion is proposed.

Self-Supervised Learning of Depth and Ego-Motion from Video by Alternative Training and Geometric Constraints from 3D to 2D

This paper aims to improve the depth-pose learning performance without the auxiliary tasks and address the above issues by alternative training each task and incorporating the epipolar geometric constraints into the Iterative Closest Point (ICP) based point clouds match process.

Motion Rectification Network for Unsupervised Learning of Monocular Depth and Camera Motion

A novel framework for unsupervised learning of monocular depth and camera motion estimation, which is applicable to dynamic scenes is proposed, trained to obtain initial inference results by assuming the scene is static and fine-tuned by jointly learning with a motion rectification network.

Monocular Visual Odometry based on joint unsupervised learning of depth and optical flow with geometric constraints

This work mitigates the scale drift issue which can further result in a degraded performance in the long-sequence scene by incorporating standard epipolar geometry into the framework and extracting correspondences over predicted optical flow and then recovering ego-motion.
...

References

SHOWING 1-10 OF 35 REFERENCES

Unsupervised Learning of Depth and Ego-Motion from Video

Empirical evaluation demonstrates the effectiveness of the unsupervised learning framework for monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and pose estimation performs favorably compared to established SLAM systems under comparable input settings.

SfM-Net: Learning of Structure and Motion from Video

A geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations, which often successfully segments the moving objects in the scene.

Unsupervised Monocular Depth Estimation with Left-Right Consistency

This paper proposes a novel training objective that enables the convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data, and produces state of the art results for monocular depth estimation on the KITTI driving dataset.

DeMoN: Depth and Motion Network for Learning Monocular Stereo

This work trains a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs, and in contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and better generalizes to structures not seen during training.

Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs

This paper tackles this challenging and essentially underdetermined problem by regression on deep convolutional neural network (DCNN) features, combined with a post-processing refining step using conditional random fields (CRF).

Dense Monocular Depth Estimation in Complex Dynamic Scenes

A novel motion segmentation algorithm is provided that segments the optical flow field into a set of motion models, each with its own epipolar geometry, and it is shown that the scene can be reconstructed based on these motion models by optimizing a convex program.

Deeper Depth Prediction with Fully Convolutional Residual Networks

A fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps is proposed and a novel way to efficiently learn feature map up-sampling within the network is presented.

Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks

By performing depth classification instead of regression, this paper can easily obtain the confidence of a depth prediction in the form of probability distribution and apply an information gain loss to make use of the predictions that are close to ground-truth during training, as well as fully-connected conditional random fields for post-processing to further improve the performance.

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

This paper employs two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally, and applies a scale-invariant error to help measure depth relations rather than scale.

Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

This work proposes a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths, and shows that this network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for singleView depth estimation.