Pixel-Perfect Structure-from-Motion with Featuremetric Refinement

  title={Pixel-Perfect Structure-from-Motion with Featuremetric Refinement},
  author={Philipp Lindenberger and Paul-Edouard Sarlin and Viktor Larsson and Marc Pollefeys},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction. The classical image matching paradigm detects keypoints per-image once and for all, which can yield poorly-localized features and propagate large errors to the final geometry. In this paper, we refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views: we first adjust the initial keypoint locations prior to any geometric… 
TC-SfM: Robust Track-Community-Based Structure-from-Motion
The approach can robustly alleviate reconstruction failure resulting from visually indistinguishable structures and accurately merge the partial reconstructions, and a novel structure is proposed, namely, track-community, in which each community consists of a group of tracks and represents a local segment in the scene.
DFNet: Enhance Absolute Pose Regression with Direct Feature Matching
It is shown that domain invariant feature matching effectively enhances camera pose estimation both in indoor and outdoor scenes and achieves a state-of-the-art accuracy by outperforming existing single-image APR methods by as much as 56%, comparable to 3D structure-based methods.
Exploiting Correspondences with All-pairs Correlations for Multi-view Depth Estimation
A novel iterative multi-view depth estimation frame- work mimicking the optimization process, which consists of a correlation volume construction module that models the pixel similarity between a reference image and source images as all-to-all correlations and a novel correlation-guided depth refinement module that reprojects points in different views to effectively fetch relevant correlations.
Improving Worst Case Visual Localization Coverage via Place-specific Sub-selection in Multi-camera Systems
This work investigates the utility of using place specific configurations, where a map is segmented into a number of places, each with its own configuration for modulating the pose estimation step, in this case selecting a camera within a multi-camera system.
ScaleNet: A Shallow Architecture for Scale Estimation
This paper designs a new architecture, ScaleNet, that exploits dilated convolutions as well as self-and cross-correlation layers to predict the scale between images and demonstrates that rectifying images with estimated scales leads to performance improvements for various tasks and methods.
MatchFormer: Interleaving Attention in Transformers for Feature Matching
This work proposes a novel hierarchical extract-and-match transformer, termed as MatchFormer, which combines selfand cross-attention on multi-scale features in a hierarchical architecture and improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data.
A Quadrifocal Tensor SFM Photogrammetry Positioning and Calibration Technique for HOFS Aerial Sensors
Nowadays, the integration between photogrammetry and structure from motion (SFM) has become much closer, and many attempts have been made to combine the two approaches to realize the positioning,
Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs
This work begins by analyzing visibility statistics for large-scale scenes, motivating a sparse network structure where parameters are specialized to different regions of the scene, and achieves a 40x speedup over conventional NeRF rendering while remaining within 0.8 db in PSNR quality, exceeding the reputation of existing fast renderers.
Efficient Linear Attention for Fast and Accurate Keypoint Matching
This work employs an efficient linear attention for the linear computational complexity of Transformers, and proposes a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints.


Multi-View Optimization of Local Feature Geometry
This work addresses the problem of refining the geometry of local image features from multiple views without known scene or camera geometry by first estimate local geometric transformations between tentative matches and then optimize the keypoint locations over multiple views jointly according to a non-linear least squares formulation.
Back to the Feature: Learning Robust Camera Localization from Pixels to Pose
PixLoc is introduced, a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model, based on the direct alignment of multiscale deep features, casting camera localization as metric learning.
Patch2Pix: Epipolar-Guided Pixel-Level Correspondences
This work presents Patch2Pix, a novel refinement network that refines match proposals by regressing pixel-level matches from the local regions defined by those proposals and jointly rejecting outlier matches with confidence scores.
Deep Probabilistic Feature-Metric Tracking
A new framework to learn a pixel-wise deep feature map and a deep feature-metric uncertainty map predicted by a Convolutional Neural Network are proposed, which together formulate a deep probabilistic feature-Metric residual of the two-view constraint that can be minimised using Gauss-Newton in a coarse-to-fine optimisation framework.
Image Matching Across Wide Baselines: From Paper to Practice
It is shown that with proper settings, classical solutions may still outperform the perceived state of the art, and the conducted experiments reveal unexpected properties of structure from motion pipelines that can help improve their performance, for both algorithmic and learned methods.
LSD-SLAM: Large-Scale Direct Monocular SLAM
A novel direct tracking method which operates on \(\mathfrak{sim}(3)\), thereby explicitly detecting scale-drift, and an elegant probabilistic solution to include the effect of noisy depth values into tracking are introduced.
Semantic Texture for Robust Dense Tracking
We argue that robust dense SLAM systems can make valuable use of the layers of features coming from a standard CNN as a pyramid of 'semantic texture' which is suitable for dense alignment while being
RANSAC-Flow: generic two-stage image alignment
This paper considers the generic problem of dense alignment between two images and proposes a two-stage process: first, a feature-based parametric coarse alignment using one or more homographies, followed by non-parametric fine pixel-wise alignment.
Photometric Bundle Adjustment for Vision-Based SLAM
The proposed algorithm relies on maximizing the photometric consistency and estimates the correspondences implicitly and is shown to improve upon the accuracy of the state-of-the-art VSLAM methods obtained using the minimization of the reprojection error using traditional BA as well as loop closure.
Direct Sparse Odometry
The experiments show that the presented approach significantly outperforms state-of-the-art direct and indirect methods in a variety of real-world settings, both in terms of tracking accuracy and robustness.