DeMoN: Depth and Motion Network for Learning Monocular Stereo

@article{Ummenhofer2017DeMoNDA,
  title={DeMoN: Depth and Motion Network for Learning Monocular Stereo},
  author={Benjamin Ummenhofer and Huizhong Zhou and Jonas Uhrig and Nikolaus Mayer and Eddy Ilg and Alexey Dosovitskiy and Thomas Brox},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={5622-5631}
}
In this paper we formulate structure from motion as a learning problem. We train a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions. The network estimates not only depth and motion, but additionally surface normals, optical flow between the images and confidence of the… 
Geometric Correspondence Network for Camera Motion Estimation
TLDR
A convolutional neural network combined with a recurrent neural network are trained together to detect the location of keypoints as well as to generate corresponding descriptors in one unified structure to be used for visual odometry.
Unsupervised learning of monocular depth and ego-motion with space–temporal-centroid loss
TLDR
DPCNN uses the triangulation principle to establish a two-channel depth consistency loss, which penalizes inconsistency of the depths estimated from the spatial images and inconsecutive temporal images, respectively.
Flow-Motion and Depth Network for Monocular Stereo and Beyond
TLDR
A learning-based method is proposed that solves monocular stereo and can be extended to fuse depth information from multiple target frames and is compared with previous methods and achieves state-of-the-art results within less time.
ENG: End-to-end Neural Geometry for Robust Depth and Pose Estimation using CNNs
TLDR
This work presents a framework that achieves state-of-the-art performance on single image depth prediction for both indoor and outdoor scenes and outperforms previous deep-learning based motion prediction approaches, and demonstrates that the state of theart metric depths can be further improved using the knowledge of pose.
UnDEMoN: Unsupervised Deep Network for Depth and Ego-Motion Estimation
TLDR
A deep network based unsupervised visual odometry system for 6-DoF camera pose estimation and finding dense depth map for its monocular view is presented and is shown to provide superior performance in depth and ego-motion estimation compared to the existing state-of-the-art methods.
Unsupervised Learning of Depth and Ego-Motion from Video
TLDR
Empirical evaluation demonstrates the effectiveness of the unsupervised learning framework for monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and pose estimation performs favorably compared to established SLAM systems under comparable input settings.
Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video
TLDR
This paper improves the joint selfsupervised method from three aspects: network structure, dynamic object segmentation, and geometric constraints, and achieves state-of-the-art performance in the estimation of pose and optical flow, and the depth estimation has also achieved competitive results.
Un-VDNet: unsupervised network for visual odometry and depth estimation
TLDR
The proposed Un-VDNet, based on unsupervised convolutional neural networks to predict camera ego-motion and depth maps from image sequences, outperforms the state-of-the-art methods for visual odometry and depth estimation in dealing with dynamic objects of outdoor and indoor scenes.
Visual odometry based on convolutional neural networks for large-scale scenes
TLDR
This work trains a novel framework, named MD-Net, and it is based on convolutional neural networks (CNNs), which can extract meaningful depth estimation and successfully estimate frame-to-frame camera rotations and translations in large scenes even texture-less.
Epipolar Geometry based Learning of Multi-view Depth and Ego-Motion from Monocular Sequences
TLDR
A 2-view depth network to infer the scene depth from consecutive frames, thereby learning inter-pixel relationships is proposed and results in better depth images and pose estimates, which capture the scene structure and motion in a better way.
...
...

References

SHOWING 1-10 OF 58 REFERENCES
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
TLDR
This paper employs two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally, and applies a scale-invariant error to help measure depth relations rather than scale.
Deep Stereo: Learning to Predict New Views from the World's Imagery
TLDR
This work presents a novel deep architecture that performs new view synthesis directly from pixels, trained from a large number of posed image sets, and is the first to apply deep learning to the problem ofnew view synthesis from sets of real-world, natural imagery.
A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation
  • N. Mayer, Eddy Ilg, T. Brox
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
TLDR
This paper proposes three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks and presents a convolutional network for real-time disparity estimation that provides state-of-the-art results.
Learning Depth from Single Monocular Images
TLDR
This work begins by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps, and applies supervised learning to predict the depthmap as a function of the image.
Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields
TLDR
A deep convolutional neural field model for estimating depths from single monocular images, aiming to jointly explore the capacity of deep CNN and continuous CRF is presented, and a deep structured learning scheme which learns the unary and pairwise potentials of continuousCRF in a unified deep CNN framework is proposed.
Modelling uncertainty in deep learning for camera relocalization
  • Alex Kendall, R. Cipolla
  • Computer Science
    2016 IEEE International Conference on Robotics and Automation (ICRA)
  • 2016
TLDR
A Bayesian convolutional neural network is used to regress the 6-DOF camera pose from a single RGB image and an estimate of the model's relocalization uncertainty is obtained to improve state of the art localization accuracy on a large scale outdoor dataset.
Learning Image Representations Tied to Ego-Motion
TLDR
This work proposes to exploit proprioceptive motor signals to provide unsupervised regularization in convolutional neural networks to learn visual representations from egocentric video to enforce that the authors' learned features exhibit equivariance, i.e, they respond predictably to transformations associated with distinct ego-motions.
Learning 3-D Scene Structure from a Single Still Image
TLDR
This work considers the problem of estimating detailed 3D structure from a single still image of an unstructured environment and uses a Markov random field (MRF) to infer a set of "plane parameters" that capture both the 3D location and 3D orientation of the patch.
FlowNet: Learning Optical Flow with Convolutional Networks
TLDR
This paper constructs CNNs which are capable of solving the optical flow estimation problem as a supervised learning task, and proposes and compares two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.
Computing the stereo matching cost with a convolutional neural network
  • J. Zbontar, Yann LeCun
  • Computer Science
    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2015
TLDR
This work trains a convolutional neural network to predict how well two image patches match and uses it to compute the stereo matching cost, which achieves an error rate of 2.61% on the KITTI stereo dataset.
...
...