Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World

  title={Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World},
  author={Sen Zhang and Jing Zhang and Dacheng Tao},
  journal={2022 International Conference on Robotics and Automation (ICRA)},
  • Sen ZhangJing ZhangD. Tao
  • Published 11 March 2022
  • Computer Science
  • 2022 International Conference on Robotics and Automation (ICRA)
Monocular visual odometry (VO) has attracted extensive research attention by providing real-time vehicle motion from cost-effective camera images. However, state-of-the-art optimization-based monocular VO methods suffer from the scale inconsistency problem for long-term predictions. Deep learning has recently been introduced to address this issue by leveraging stereo sequences or ground-truth motions in the training dataset. However, it comes at an additional cost for data collection, and such… 

Figures and Tables from this paper

Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics

By leveraging IMU during training, DynaDepth not only learns an absolute scale, but also provides a better generalization ability and robustness against vision degradation such as illumination change and moving objects.

Information-Theoretic Odometry Learning

This paper bound the generalization errors of the deep information bottleneck framework and the predictability of the latent representation of the stochastic latent representation to provide not only a performance guarantee but also practical guidance for model design, sample collection, and sensor selection.

JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

A novel joint perception framework named JPerceiver is proposed, which can simultaneously estimate scale-aware depth and VO as well as BEV layout from a monocular video sequence based on a carefully-designed scale loss.

Towards Accurate Ground Plane Normal Estimation from Ego-Motion

A novel approach for ground plane normal estimation of wheeled vehicles that fully utilizes the underlying connection between the ego pose odometry (ego-motion) and its nearby ground plane and achieves state-of-the-art accuracy on KITTI dataset with the estimated vector error of 0.39°.

SIR: Self-Supervised Image Rectification via Seeing the Same Scene From Multiple Different Lenses

A novel self-supervised image rectification (SIR) method based on an important insight that the rectified results of distorted images of a same scene from different lenses should be the same, with comparable or even better performance than the supervised baseline method and representative state-of-the-art (SOTA) methods.



Enhancing Self-Supervised Monocular Depth Estimation with Traditional Visual Odometry

This paper enables to further improve monocular depth estimation by integrating into existing self-supervised networks a geometrical prior, and proposes a sparsity-invariant autoencoder able to process the output of conventional visual odometry algorithms working in synergy with depth-from-mono networks.

Generalizing to the Open World: Deep Visual Odometry with Online Adaptation

This paper proposes an online adaptation framework for deep VO with the assistance of scene-agnostic geometric computations and Bayesian inference that enables fast adaptation of deep VO networks to unseen environments in a self-supervised manner.

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

This paper proposes a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions and is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale- Consistent camera trajectories over a long video sequence.

Visual Odometry Revisited: What Should Be Learnt?

This work revisit the basics of VO and explore the right way for integrating deep learning with epipolar geometry and Perspective-n-Point method and design a simple but robust frame-to-frame VO algorithm (DF-VO) which outperforms pure deep learning-based and geometry-based methods.

Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge

This paper proposes monoResMatch, a novel deep architecture designed to infer depth from a single input image by synthesizing features from a different point of view, horizontally aligned with the input image, performing stereo matching between the two cues and shows how obtaining proxy ground truth annotation through traditional stereo algorithms enables more accurate monocular depth estimation.

Towards Better Generalization: Joint Depth-Pose Learning Without PoseNet

A novel system that explicitly disentangles scale from the network estimation, which achieves state-of-the-art results among self-supervised learning-based methods on KITTI Odometry and NYUv2 dataset and presents some interesting findings on the limitation of PoseNet-based relative pose estimation methods in terms of generalization ability.

Self-Supervised Deep Visual Odometry With Online Adaptation

An online meta-learning algorithm is proposed to enable VO networks to continuously adapt to new environments in a self-supervised manner and utilizes convolutional long short-term memory (convLSTM) to aggregate rich spatial-temporal information in the past.

Deep Online Correction for Monocular Visual Odometry

Though without complex back-end optimization modules, the proposed deep online correction framework achieves outstanding performance with relative transform error (RTE) = 2.0% on KITTI Odometry benchmark for Seq.

Real-Time Monocular Depth Estimation Using Synthetic Data with Domain Adaptation via Image Style Transfer

This work takes advantage of style transfer and adversarial training to predict pixel perfect depth from a single real-world color image based on training over a large corpus of synthetic environment data.

Digging Into Self-Supervised Monocular Depth Estimation

It is shown that a surprisingly simple model, and associated design choices, lead to superior predictions, and together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.