End-to-End Learnable Geometric Vision by Backpropagating PnP Optimization

@article{Chen2020EndtoEndLG,
  title={End-to-End Learnable Geometric Vision by Backpropagating PnP Optimization},
  author={Bo Chen and {\'A}lvaro Parra and Jiewei Cao and Nan Li and Tat-Jun Chin},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={8097-8106}
}
  • Bo Chen, Álvaro Parra, Tat-Jun Chin
  • Published 13 September 2019
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Deep networks excel in learning patterns from large amounts of data. On the other hand, many geometric vision tasks are specified as optimization problems. To seamlessly combine deep learning and geometric vision, it is vital to perform learning and geometric optimization end-to-end. Towards this aim, we present BPnP, a novel network module that backpropagates gradients through a Perspective-n-Points (PnP) solver to guide parameter updates of a neural network. Based on implicit differentiation… 

Figures from this paper

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose

TLDR
PixLoc is introduced, a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model, based on the direct alignment of multiscale deep features, casting camera localization as metric learning.

Solving the Blind Perspective-n-Point Problem End-To-End With Robust Differentiable Geometric Optimization

TLDR
This work proposes the first fully end-to-end trainable network for solving the blind PnP problem efficiently and globally, that is, without the need for pose priors, and makes use of recent results in differentiating optimization problems to incorporate geometric model fitting into an end- to-end learning framework.

[Re] On end-to-end 6DoF object pose estimation and robustness to object scale

Further, our results indicate that indeed HigherHRNet improves keypoint localisation performance on small scale objects.

Detecting Object Surface Keypoints From a Single RGB Image via Deep Learning Network for 6-DoF Pose Estimation

TLDR
Techniques for defining 3D object surface keypoints and predicting their corresponding 2D counterparts via deep-learning network architectures are presented and Experimental results show that the proposed technique outperforms state-of-the-art approaches in both “2D projection” and “3D transformation” metrics.

MonoRUn: Monocular 3D Object Detection by Self-Supervised Reconstruction and Uncertainty Propagation

TLDR
MonoRUn is a novel detection framework that learns dense correspondences and geometry in a self-supervised manner, with simple 3D bounding box annotations, and outperforms current state-of-the-art methods on KITTI benchmark.

MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation

TLDR
MonoRUn is a novel detection framework that learns dense correspondences and geometry in a self-supervised manner, with simple 3D bounding box annotations, and outperforms current state-of-the-art methods on KITTI benchmark.

Exploiting Problem Structure in Deep Declarative Networks: Two Case Studies

TLDR
This work studies two applications of deep declarative networks—robust vector pooling and optimal transport—and shows how problem structure can be exploited to obtain very efficient backward pass computations in terms of both time and memory.

SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation

TLDR
This work addresses the challenge of directly regressing all 6 degrees-of-freedom for the object pose in a cluttered environment from a single RGB image by means of a novel reasoning about self-occlusion, in order to establish a two-layer representation for 3D objects which considerably enhances the accuracy of end-to-end 6D pose estimation.

To The Point: Correspondence-driven monocular 3D category reconstruction

TLDR
To The Point (TTP), a method for reconstructing 3D objects from a single image using 2D to 3D correspondences learned from weak supervision, uses a simple per-sample optimization problem to replace CNN-based regression of camera pose and non-rigid deformation and thereby obtain substantially more accurate 3D reconstructions.

RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering

TLDR
RePOSE leverages image rendering for fast feature extraction using a 3D model with a learnable texture and utilizes differentiable Levenberg-Marquardt (LM) optimization to refine a pose fast and accurately by minimizing the distance between the input and rendered image representations without the need of zooming in.

References

SHOWING 1-10 OF 60 REFERENCES

End-to-End Learning of Geometry and Context for Deep Stereo Regression

We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem’s geometry to form a cost volume using deep feature

DSAC — Differentiable RANSAC for Camera Localization

TLDR
DSAC is applied to the problem of camera localization, where deep learning has so far failed to improve on traditional approaches, and it is demonstrated that by directly minimizing the expected loss of the output camera poses, robustly estimated by RANSAC, it achieves an increase in accuracy.

Geometric Loss Functions for Camera Pose Regression with Deep Learning

  • Alex KendallR. Cipolla
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
TLDR
A number of novel loss functions for learning camera pose which are based on geometry and scene reprojection error are explored, and it is shown how to automatically learn an optimal weighting to simultaneously regress position and orientation.

Deeper Depth Prediction with Fully Convolutional Residual Networks

TLDR
A fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps is proposed and a novel way to efficiently learn feature map up-sampling within the network is presented.

PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization

TLDR
This work trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation, demonstrating that convnets can be used to solve complicated out of image plane regression problems.

Modelling uncertainty in deep learning for camera relocalization

  • Alex KendallR. Cipolla
  • Computer Science
    2016 IEEE International Conference on Robotics and Automation (ICRA)
  • 2016
TLDR
A Bayesian convolutional neural network is used to regress the 6-DOF camera pose from a single RGB image and an estimate of the model's relocalization uncertainty is obtained to improve state of the art localization accuracy on a large scale outdoor dataset.

gvnn: Neural Network Library for Geometric Computer Vision

TLDR
Gvnn, a neural network library in Torch aimed towards bridging the gap between classic geometric computer vision and deep learning is introduced, and several new layers which are often used as parametric transformations on the data in geometricComputer vision are proposed.

Numerical Coordinate Regression with Convolutional Neural Networks

TLDR
The differentiable spatial to numerical transform (DSNT) is proposed, which adds no trainable parameters, is fully differentiable, and exhibits good spatial generalization and offers a better trade-off between inference speed and prediction accuracy compared to existing techniques.

Understanding the Limitations of CNN-Based Absolute Camera Pose Regression

TLDR
A theoretical model for camera pose regression is developed that is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure, and shows that additional research is needed before pose regression algorithms are ready to compete with structure-based methods.

Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning

TLDR
This paper presents KeypointNet, an end-to-end geometric reasoning framework to learn an optimal set of category-specific 3D keypoints, along with their detectors, and demonstrates that this framework outperforms a fully supervised baseline using the same neural network architecture on the task of pose estimation.
...