• Corpus ID: 231802071

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

@inproceedings{Lee2021LearningMD,
  title={Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency},
  author={Seokju Lee and Sunghoon Im and Stephen Lin and In So Kweon},
  booktitle={AAAI},
  year={2021}
}
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified… 

Figures and Tables from this paper

Instance-aware multi-object self-supervision for monocular depth prediction
TLDR
The proposed self-supervised monocular image-to-depth prediction framework is shown to largely outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed.
Unsupervised Scale-consistent Depth Learning from Video
TLDR
A monocular depth estimation method SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time, and a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training is proposed.
Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth
TLDR
The method, called DynamicDepth, is a new framework trained via a self-supervised cycle consistent learning scheme to solve the mismatch problem and significantly outperforms the state-of-the-art monocular depth prediction methods, especially in the areas of dynamic objects.
Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation
TLDR
This work designs an integrated motion model that estimates the motion of the camera and object in the first and second warping stages, respectively, controlled by the attention module through a shared motion encoder.
Auto-Rectify Network for Unsupervised Indoor Depth Estimation.
TLDR
This work establishes that the complex ego-motions exhibited in handheld settings are a critical obstacle for learning depth, and proposes an Auto-Rectify Network with novel loss functions, which can automatically learn to rectify images during training.
Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation
TLDR
This work proposes novel ideas to improve self-supervised monocular depth estimation by leveraging cross-domain information, especially scene semantics, and proposes two ideas: a metric learning approach that exploits the semantics-guided local geometry to optimize intermediate depth representations and a novel feature fusion module that judiciously utilizes cross-modality between two heterogeneous feature representations.
Unsupervised Monocular Depth Estimation in Highly Complex Environments
TLDR
This paper addresses the problem of unsupervised monocular depth estimation in certain highly complex scenarios by using domain adaptation, and a unified image transfer-based adaptation framework is proposed based on monocular videos in this paper.
PLNet: Plane and Line Priors for Unsupervised Indoor Depth Estimation
TLDR
This paper proposes PLNet that leverages the plane and line priors to enhance the depth estimation and evaluates the flatness and straightness of the predicted point cloud on the reliable planar and linear regions.
Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation
TLDR
This work constructs Guided Cost volume Excitation (GCE) and shows that simple channel excitation of cost volume guided by image can improve performance considerably and proposes a novel method of using top-k selection prior to soft-argmin disparity regression for computing the final disparity estimate.
Semi-Supervised Learning with Mutual Distillation for Monocular Depth Estimation
TLDR
A semi-supervised learning framework for monocular depth estimation is proposed by building two separate network branches for each loss and distilling each other through the mutual distillation loss function, which achieves the complementary advantages of both loss functions.
...
1
2
...

References

SHOWING 1-10 OF 55 REFERENCES
Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
TLDR
Performance on the five tasks of depth estimation, optical flow estimation, odometry, moving object segmentation and scene flow estimation shows that the approach outperforms other SoTA methods, demonstrating the effectiveness of each module of the proposed method.
Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video
TLDR
This paper proposes a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions and is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale- Consistent camera trajectories over a long video sequence.
Self-Supervised Monocular Scene Flow Estimation
  • Junhwa Hur, S. Roth
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
TLDR
This work designs a single convolutional neural network (CNN) that successfully estimates depth and 3D motion simultaneously from a classical optical flow cost volume, and adopts self-supervised learning with 3D loss functions and occlusion reasoning to leverage unlabeled data.
Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera
We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video -- addressing the difficulty of acquiring realistic
Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
TLDR
The main contribution is to explicitly consider the inferred 3D geometry of the whole scene, and enforce consistency of the estimated 3D point clouds and ego-motion across consecutive frames, and outperforms the state-of-the-art for both breadth and depth.
Digging Into Self-Supervised Monocular Depth Estimation
TLDR
It is shown that a surprisingly simple model, and associated design choices, lead to superior predictions, and together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.
GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose
TLDR
An adaptive geometric consistency loss is proposed to increase robustness towards outliers and non-Lambertian regions, which resolves occlusions and texture ambiguities effectively and achieves state-of-the-art results in all of the three tasks, performing better than previously unsupervised methods and comparably with supervised ones.
Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras
TLDR
This work is the first to learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale.
Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance
TLDR
A new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects, such as moving cars and pedestrians, which violate the static-world assumptions typically made during training of such models.
Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation
TLDR
The proposed SceneNet model is able to perform region-aware depth estimation by enforcing semantics consistency between stereo pairs and produces favorable results against the state-of-the-art approaches do.
...
1
2
3
4
5
...