• Corpus ID: 231802071

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

  title={Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency},
  author={Seokju Lee and Sunghoon Im and Stephen Lin and In So Kweon},
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified… 

Figures and Tables from this paper

Instance-aware multi-object self-supervision for monocular depth prediction
The proposed self-supervised monocular image-to-depth prediction framework is shown to largely outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed.
Unsupervised Scale-consistent Depth Learning from Video
A monocular depth estimation method SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time, and a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training is proposed.
Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth
The method, called DynamicDepth, is a new framework trained via a self-supervised cycle consistent learning scheme to solve the mismatch problem and significantly outperforms the state-of-the-art monocular depth prediction methods, especially in the areas of dynamic objects.
Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation
This work designs an integrated motion model that estimates the motion of the camera and object in the first and second warping stages, respectively, controlled by the attention module through a shared motion encoder.
Auto-Rectify Network for Unsupervised Indoor Depth Estimation.
This work establishes that the complex ego-motions exhibited in handheld settings are a critical obstacle for learning depth, and proposes an Auto-Rectify Network with novel loss functions, which can automatically learn to rectify images during training.
Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation
This work proposes novel ideas to improve self-supervised monocular depth estimation by leveraging cross-domain information, especially scene semantics, and proposes two ideas: a metric learning approach that exploits the semantics-guided local geometry to optimize intermediate depth representations and a novel feature fusion module that judiciously utilizes cross-modality between two heterogeneous feature representations.
Dyna-DM: Dynamic Object-aware Self-supervised Monocular Depth Maps
This paper proposes only using invariant pose loss for the first few epochs during training, disregarding small potentially dynamic objects when training, and employing an appearance-based approach to separately estimate object pose for truly dynamic objects to result in qualitatively and quantitatively improved depth maps.
Unsupervised Monocular Depth Estimation in Highly Complex Environments
The problem of unsupervised monocular depth estimation in highly complex scenarios is investigated and an image adaptation approach is proposed to evaluate the quality of transferred images and re-weight the corresponding losses, so as to improve the performance of the adapted depth model.
PLNet: Plane and Line Priors for Unsupervised Indoor Depth Estimation
This paper proposes PLNet that leverages the plane and line priors to enhance the depth estimation and evaluates the flatness and straightness of the predicted point cloud on the reliable planar and linear regions.
Adaptive confidence thresholding for monocular depth estimation
This paper proposes a new approach to leverage pseudo ground truth depth maps of stereo images generated from self-supervised stereo matching methods and proposes the probabilistic framework that refines the monocular depth map with the help of its uncertainty map through the pixel-adaptive convolution (PAC) layer.


Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Performance on the five tasks of depth estimation, optical flow estimation, odometry, moving object segmentation and scene flow estimation shows that the approach outperforms other SoTA methods, demonstrating the effectiveness of each module of the proposed method.
Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video
This paper proposes a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions and is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale- Consistent camera trajectories over a long video sequence.
Self-Supervised Monocular Scene Flow Estimation
  • Junhwa Hur, S. Roth
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This work designs a single convolutional neural network (CNN) that successfully estimates depth and 3D motion simultaneously from a classical optical flow cost volume, and adopts self-supervised learning with 3D loss functions and occlusion reasoning to leverage unlabeled data.
Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera
We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video -- addressing the difficulty of acquiring realistic
Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
The main contribution is to explicitly consider the inferred 3D geometry of the whole scene, and enforce consistency of the estimated 3D point clouds and ego-motion across consecutive frames, and outperforms the state-of-the-art for both breadth and depth.
Digging Into Self-Supervised Monocular Depth Estimation
It is shown that a surprisingly simple model, and associated design choices, lead to superior predictions, and together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.
GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose
An adaptive geometric consistency loss is proposed to increase robustness towards outliers and non-Lambertian regions, which resolves occlusions and texture ambiguities effectively and achieves state-of-the-art results in all of the three tasks, performing better than previously unsupervised methods and comparably with supervised ones.
Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras
This work is the first to learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale.
Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance
A new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects, such as moving cars and pedestrians, which violate the static-world assumptions typically made during training of such models.
Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation
The proposed SceneNet model is able to perform region-aware depth estimation by enforcing semantics consistency between stereo pairs and produces favorable results against the state-of-the-art approaches do.