Corpus ID: 236447392

MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

  title={MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments},
  author={Pan Ji and Runze Li and Bir Bhanu and Yi Xu},
Self-supervised depth estimation for indoor environments is more challenging than its outdoor counterpart in at least the following two aspects: (i) the depth range of indoor sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences contain much more rotational motions, which cause difficulties for the… Expand


Unsupervised Depth Learning in Challenging Indoor Video: Weak Rectification to Rescue
This work establishes that the degenerate camera motions exhibited in handheld settings are a critical obstacle for unsupervised depth learning and proposes a novel data pre-processing method for effective training, i.e., search for image pairs with modest translation and remove their rotation via the proposed weak image rectification. Expand
Moving Indoor: Unsupervised Video Depth Learning in Challenging Environments
This work proposes a new optical-flow based training paradigm which reduces the difficulty of unsupervised learning by providing a clearer training target and handles the non-texture regions. Expand
Digging Into Self-Supervised Monocular Depth Estimation
It is shown that a surprisingly simple model, and associated design choices, lead to superior predictions, and together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Expand
Towards Good Practice for CNN-Based Monocular Depth Estimation
By a careful redesign, a model for depth estimation is presented, which achieves competitive performance on KITTI and state-of-the-art performance on NYU Depth v2. Expand
Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video
This paper proposes a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions and is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale- Consistent camera trajectories over a long video sequence. Expand
Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras
This work is the first to learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale. Expand
Learning Monocular Depth by Distilling Cross-domain Stereo Networks
This paper proposes to use the stereo matching network as a proxy to learn depth from synthetic data and use predicted stereo disparity maps for supervising the monocular depth estimation network. Expand
Unsupervised Learning of Depth and Ego-Motion from Video
Empirical evaluation demonstrates the effectiveness of the unsupervised learning framework for monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and pose estimation performs favorably compared to established SLAM systems under comparable input settings. Expand
Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling
This paper model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module, and proposes a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a "global" loss given the firststage features. Expand
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer
This work proposes a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlights the importance of pretraining encoders on auxiliary tasks. Expand