Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

  title={Vision Transformer for NeRF-Based View Synthesis from a Single Input Image},
  author={Kai-En Lin and Yen-Chen Lin and Wei-Sheng Lai and Tsung-Yi Lin and Yichang Shih and Ravi Ramamoorthi},
  journal={2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
Although neural radiance fields (NeRF) have shown impressive advances in novel view synthesis, most methods require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches using local image features to reconstruct a 3D object often render blurry predictions at viewpoints distant from the source view. To address this, we propose to leverage both the global and local features to… 

NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion

NerfDiff is proposed, which addresses this issue by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF through synthesizing and refining a set of virtual views at test time, and significantly outperforms existing NeRF-based and geometry-free approaches on challenging datasets, including ShapeNet, ABO, and Clevr3D.

SPARF: Large-Scale Learning of 3D Sparse Radiance Fields from Few Input Images

This work presents SPARF, a large-scale ShapeNet-based synthetic dataset for novel view synthesis consisting of 17 million images rendered from nearly 40,000 shapes at high resolution (400 X 400 pixels), and proposes a novel pipeline (SuRFNet) that learns to generate sparse voxel radiance fields from only few views.

Neural Plenoptic Sampling: Learning Light-Field from Thousands of Imaginary Eyes

A simple Multi-Layer Perceptron (MLP) network is adopted as a universal function approximator to learn the plenoptic function at every position in the space of interest by placing virtual viewpoints at thousands of randomly sampled locations and leveraging multi-view geometric relationship.

SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields

This work proposes SceneRF, a self-supervised monocular scene reconstruction method using only posed image sequences for training and optimizing a radiance field though with explicit depth optimization and a novel probabilistic sampling strategy to efficiently handle large scenes.

Novel View Synthesis with Diffusion Models

3DiM, a diffusion model for 3D novel view synthesis, is presented, which is able to translate a single input view into consistent and sharp completions across many views, and a new evaluation methodology, 3D consistency scoring, is introduced to measure the3D consistency of a generated object by training a neural on the model’s output views.

NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views

A novel framework, dubbed NeuralLift-360, that utilizes a depth-aware neural radiance representation (NeRF) and learns to craft the scene guided by denoising diffusion models, which outperforms existing state-of-the-art baselines.

A Neural ODE Interpretation of Transformer Layers

A modification of the internal architecture of a transformer layer is proposed and it is shown that using neural ODE solvers with a sophisticated integration scheme further improves performance of transformer networks in multiple tasks.

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

This paper explores an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU, and is one to two orders of mag-nitude faster to sample from, offering a practical trade-off for some use cases.

M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

A model-accelerator co-design framework to enable efficient on-device MTL, that tackles both training and inference bottlenecks and achieves higher accuracies than encoder-focused MTL methods, while reducing 88% inference FLOPs.

Is Attention All NeRF Needs?

The analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physically-grounded rendering, showing the promise of transformers as a universal modeling tool for graphics.



pixelNeRF: Neural Radiance Fields from One or Few Images

We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

This work describes how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrates results that outperform prior work on neural rendering and view synthesis.

IBRNet: Learning Multi-View Image-Based Rendering

A method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views using a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations.

Deep view synthesis from sparse photometric images

This paper synthesizes novel viewpoints across a wide range of viewing directions (covering a 60° cone) from a sparse set of just six viewing directions, based on a deep convolutional network trained to directly synthesize new views from the six input views.

Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

The Scene Representation Transformer (SRT) is proposed, a method which processes posed or unposed RGB images of a new area, infers a “set-latent scene representation ”, and synthesises novel views, all in a single feed-forward pass, enabling global information integration, and hence 3D reasoning.

Deep Stereo: Learning to Predict New Views from the World's Imagery

This work presents a novel deep architecture that performs new view synthesis directly from pixels, trained from a large number of posed image sets, and is the first to apply deep learning to the problem ofnew view synthesis from sets of real-world, natural imagery.

Fast and Explicit Neural View Synthesis

It is shown that with the simple formulation, the model is able to generalize novel view synthesis to object categories not seen during training and can use view synthesis as a self-supervision signal for efficient learning of 3D geometry without explicit 3D supervision.

Learning-based view synthesis for light field cameras

This paper proposes a novel learning-based approach to synthesize new views from a sparse set of input views that could potentially decrease the required angular resolution of consumer light field cameras, which allows their spatial resolution to increase.

Stereo magnification

This paper explores an intriguing scenario for view synthesis: extrapolating views from imagery captured by narrow-baseline stereo cameras, including VR cameras and now-widespread dual-lens camera phones, and proposes a learning framework that leverages a new layered representation that is called multiplane images (MPIs).

Stereo Magnification: Learning View Synthesis using Multiplane Images

This paper explores an intriguing scenario for view synthesis: extrapolating views from imagery captured by narrow-baseline stereo cameras, including VR cameras and now-widespread dual-lens camera phones, and proposes a learning framework that leverages a new layered representation that is called multiplane images (MPIs).