GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose

@article{Yin2018GeoNetUL,
  title={GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose},
  author={Zhichao Yin and Jianping Shi},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={1983-1992}
}
  • Zhichao YinJianping Shi
  • Published 6 March 2018
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
We propose GeoNet, a jointly unsupervised learning framework for monocular depth, optical flow and egomotion estimation from videos. The three components are coupled by the nature of 3D scene geometry, jointly learned by our framework in an end-to-end manner. Specifically, geometric relationships are extracted over the predictions of individual modules and then combined as an image reconstruction loss, reasoning about static and dynamic scene parts separately. Furthermore, we propose an… 

Figures and Tables from this paper

Monocular Visual Odometry based on joint unsupervised learning of depth and optical flow with geometric constraints

This work mitigates the scale drift issue which can further result in a degraded performance in the long-sequence scene by incorporating standard epipolar geometry into the framework and extracting correspondences over predicted optical flow and then recovering ego-motion.

Beyond Photometric Loss for Self-Supervised Ego-Motion Estimation

This paper bridges the gap between geometric loss and photometric loss by introducing the matching loss constrained by epipolar geometry in a self-supervised framework and outperforms the state-of-the-art unsupervised egomotion estimation methods by a large margin.

Unsupervised Learning of Depth, Camera Pose and Optical Flow from Monocular Video

Evaluation on KITTI and Cityscapes driving datasets reveals that the proposed DFPNet model achieves results comparable to state-of-the-art in all of the three tasks, even with the significantly smaller model size.

Un-VDNet: unsupervised network for visual odometry and depth estimation

The proposed Un-VDNet, based on unsupervised convolutional neural networks to predict camera ego-motion and depth maps from image sequences, outperforms the state-of-the-art methods for visual odometry and depth estimation in dealing with dynamic objects of outdoor and indoor scenes.

USegScene: Unsupervised Learning of Depth, Optical Flow and Ego-Motion with Semantic Guidance and Coupled Networks

This paper proposes USegScene, a framework for semantically guided unsupervised learning of depth, optical flow and ego-motion estimation for stereo camera images using convolutional neural networks and presents results on the popular KITTI dataset.

Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation

The proposed SceneNet model is able to perform region-aware depth estimation by enforcing semantics consistency between stereo pairs and produces favorable results against the state-of-the-art approaches do.

Unsupervised learning of monocular depth and ego-motion with space–temporal-centroid loss

DPCNN uses the triangulation principle to establish a two-channel depth consistency loss, which penalizes inconsistency of the depths estimated from the spatial images and inconsecutive temporal images, respectively.

Unsupervised learning of monocular depth and ego-motion with space–temporal-centroid loss

DPCNN uses the triangulation principle to establish a two-channel depth consistency loss, which penalizes inconsistency of the depths estimated from the spatial images and inconsecutive temporal images, respectively.

Semantics-Driven Unsupervised Learning for Monocular Depth and Ego-Motion Estimation

This paper proposes a semantics-driven unsupervised learning approach for monocular depth and ego-motion estimation from videos that exploits semantic segmentation information to mitigate the effects of dynamic objects and occlusions in the scene, and to improve depth prediction performance by considering the correlation between depth and semantics.

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

This paper proposes a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions and is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale- Consistent camera trajectories over a long video sequence.
...

References

SHOWING 1-10 OF 59 REFERENCES

Unsupervised Learning of Depth and Ego-Motion from Video

Empirical evaluation demonstrates the effectiveness of the unsupervised learning framework for monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and pose estimation performs favorably compared to established SLAM systems under comparable input settings.

DeMoN: Depth and Motion Network for Learning Monocular Stereo

This work trains a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs, and in contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and better generalizes to structures not seen during training.

3D Scene Flow Estimation with a Piecewise Rigid Scene Model

This work proposes to represent the dynamic scene as a collection of rigidly moving planes, into which the input images are segmented, and shows that a view-consistent multi-frame scheme significantly improves accuracy, especially in the presence of occlusions, and increases robustness against adverse imaging conditions.

Geometric Loss Functions for Camera Pose Regression with Deep Learning

  • Alex KendallR. Cipolla
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
A number of novel loss functions for learning camera pose which are based on geometry and scene reprojection error are explored, and it is shown how to automatically learn an optimal weighting to simultaneously regress position and orientation.

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

This paper employs two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally, and applies a scale-invariant error to help measure depth relations rather than scale.

Geometry-Aware Learning of Maps for Camera Localization

This work proposes to represent maps as a deep neural net called MapNet, which enables learning a data-driven map representation and proposes a novel parameterization for camera rotation which is better suited for deep-learning based camera pose regression.

PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization

This work trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation, demonstrating that convnets can be used to solve complicated out of image plane regression problems.

Object scene flow for autonomous vehicles

A novel model and dataset for 3D scene flow estimation with an application to autonomous driving by representing each element in the scene by its rigid motion parameters and each superpixel by a 3D plane as well as an index to the corresponding object.

Bounding Boxes, Segmentations and Object Coordinates: How Important is Recognition for 3D Scene Flow Estimation in Autonomous Driving Scenarios?

The importance of recognition granularity is investigated, from coarse 2D bounding box estimates over 2D instance segmentations to fine-grained 3D object part predictions, and it is observed that the instance segmentation cue is by far strongest, in the authors' setting.

Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness

An unsupervised approach to train a convnet end-to-end for predicting optical flow between two images using a loss function that combines a data term that measures photometric constancy over time with a spatial term that models the expected variation of flow across the image.
...