Corpus ID: 235742858

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

  title={TransformerFusion: Monocular RGB Scene Reconstruction using Transformers},
  author={Aljavz Bovzivc and Pablo Rodr{\'i}guez Palafox and Justus Thies and Angela Dai and M. Nie{\ss}ner},
We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene… Expand
VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction
This paper advocates that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results of 3D reconstruction techniques. Expand


NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video
To the best of the knowledge, this is the first learning-based system that is able to reconstruct dense coherent 3D geometry in real-time and outperforms state-of-the-art methods in terms of both accuracy and speed. Expand
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks. Expand
3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation
3DMV is presented, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network that achieves significantly better results than existing baselines. Expand
MVDepthNet: Real-Time Multiview Depth Estimation Neural Network
  • Kaixuan Wang, S. Shen
  • Computer Science, Engineering
  • 2018 International Conference on 3D Vision (3DV)
  • 2018
MVDepthNet is presented, a convolutional network to solve the depth estimation problem given several image-pose pairs from a localized monocular camera in neighbor viewpoints, and it is shown that this method can generate depth maps efficiently and precisely. Expand
3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans
3D-SIS is introduced, a novel neural network architecture for 3D semantic instance segmentation in commodity RGB-D scans that leverages high-resolution RGB input by associating 2D images with the volumetric grid based on the pose alignment of the 3D reconstruction. Expand
Neural RGB®D Sensing: Depth and Uncertainty From a Video Camera
This paper proposes a deep learning method to estimate per-pixel depth and its uncertainty continuously from a monocular video stream, with the goal of effectively turning an RGB camera into an RGB-D camera. Expand
DPSNet: End-to-end Deep Plane Sweep Stereo
A convolutional neural network called DPSNet (Deep Plane Sweep Network) whose design is inspired by best practices of traditional geometry-based approaches for dense depth reconstruction, achieves state-of-the-art reconstruction results on a variety of challenging datasets. Expand
SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans
A novel approach that converts partial and noisy RGB-D scans into high-quality 3D scene reconstructions by inferring unobserved scene geometry and combined with a new 3D sparse generative convolutional neural network architecture is able to predict highly detailed surfaces in a coarse-to-fine hierarchical fashion. Expand
Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries
Two improvements to existing approaches to single image depth estimation are proposed, one about the strategy of fusing features extracted at different scales, for which an improved network architecture consisting of four modules is proposed: an encoder, decoder, multi-scale feature fusion module, and refinement module. Expand
RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction
RfD-Net is introduced that jointly detects and reconstructs dense object surfaces directly from raw point clouds and consistently outperforms the state-of-the-arts and improves over 11 of mesh IoU in object reconstruction. Expand