NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video

  title={NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video},
  author={Jiaming Sun and Yiming Xie and Linghao Chen and Xiaowei Zhou and Hujun Bao},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Jiaming Sun, Yiming Xie, H. Bao
  • Published 1 April 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments… 

Figures and Tables from this paper

PlanarRecon: Realtime 3D Plane Detection and Reconstruction from Posed Monocular Videos

PlanarRecon is a novel framework for globally coherent detection and reconstruction of 3D planes from a posed monocular video that achieves state-of-the-art performances on the ScanNet dataset while being real-time.

VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

This paper advocates that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results of 3D reconstruction techniques.

3DVNet: Multi-View Depth Prediction and Volumetric Refinement

Experimental results show the 3DVNet method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics on the ScanNet dataset, as well as a selection of scenes from the TUM-RGBD and ICL-NUIM datasets, which shows that the method is both effective and generalizes to new settings.

SimpleRecon: 3D Reconstruction Without 3D Convolutions

This work proposes a simple state-of-the-art multi-view depth estimator with two main contributions: a carefully-designed 2D CNN which utilizes strong image priors alongside a plane-sweep feature volume and geometric losses, combined with the integration of keyframe and geometric metadata into the cost volume which allows informed depth plane scoring.

Neural 3D Scene Reconstruction with the Manhattan-world Assumption

This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images by showing that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods and designs a novel loss that jointly optimizes the scene geometry and semantics in 3D space.

Neural 3D Reconstruction in the Wild

This work introduces a new method that enables efficient and accurate surface reconstruction from Internet photo collections in the presence of varying illumination and proposes a hybrid voxel- and surface-guided sampling technique that allows for more efficient ray sampling around surfaces and leads to significant improvements in reconstruction quality.

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

This work introduces TransformerFusion, a transformer-based 3D scene reconstruction approach that results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTMor GRU-based recurrent networks for video sequence fusion.

Differentiable Gradient Sampling for Learning Implicit 3D Scene Reconstructions from a Single Image

This paper derives a novel closed-form Differentiable Gradient Sampling (DGS) solution that enables backpropagation of the loss on spatial gradients to the feature maps, thus allowing training on large-scale scenes without dense 3D supervision.

HRBF-Fusion: Accurate 3D Reconstruction from RGB-D Data Using On-the-fly Implicits

Reconstruction of high-fidelity 3D objects or scenes is a fundamental research problem. Recent advances in RGB-D fusion have demonstrated the potential of producing 3D models from consumer-level

SparseNeuS: Fast Generalizable Neural Surface Reconstruction from Sparse views

This work introduces SparseNeuS, a novel neural rendering based method for the task of surface reconstruction from multi-view images that not only outperforms the state-of-the-art methods, but also exhibits good efficiency, generalizability, and flexibility.



RoutedFusion: Learning Real-Time Depth Map Fusion

This work proposes a neural network that predicts non-linear updates to better account for typical fusion errors and outperforms the traditional fusion approach and related learned approaches on both synthetic and real data.

Consistent video depth estimation

An algorithm for reconstructing dense, geometrically consistent depth for all pixels in a monocular video by using a learning-based prior, i.e., a convolutional neural network trained for single-image depth estimation.

MVDepthNet: Real-Time Multiview Depth Estimation Neural Network

MVDepthNet is presented, a convolutional network to solve the depth estimation problem given several image-pose pairs from a localized monocular camera in neighbor viewpoints, and it is shown that this method can generate depth maps efficiently and precisely.

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

The 3D-R2N2 reconstruction framework outperforms the state-of-the-art methods for single view reconstruction, and enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).

Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance

This work introduces a neural network architecture that simultaneously learns the unknown geometry, camera parameters, and a neural renderer that approximates the light reflected from the surface towards the camera.

Atlas: End-to-End 3D Scene Reconstruction from Posed Images

An end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images is presented and semantic segmentation of the 3D model is obtained without significant computation.

Occupancy Networks: Learning 3D Reconstruction in Function Space

This paper proposes Occupancy Networks, a new representation for learning-based 3D reconstruction methods that encodes a description of the 3D output at infinite resolution without excessive memory footprint, and validate that the representation can efficiently encode 3D structure and can be inferred from various kinds of input.

MVSNet: Depth Inference for Unstructured Multi-view Stereo

This work presents an end-to-end deep learning architecture for depth map inference from multi-view images that flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature.

DeMoN: Depth and Motion Network for Learning Monocular Stereo

This work trains a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs, and in contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and better generalizes to structures not seen during training.

Learning a Multi-View Stereo Machine

End-to-end learning allows us to jointly reason about shape priors while conforming geometric constraints, enabling reconstruction from much fewer images than required by classical approaches as well as completion of unseen surfaces.