SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image

  title={SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image},
  author={Florian Langer and Gwangbin Bae and Ignas Budvytis and Roberto Cipolla},
Estimating 3D shapes and poses of static objects from a single image has important applications for robotics, augmented reality and digital content creation. Often this is done through direct mesh predictions [15, 30, 39] which produces unrealistic, overly tessellated shapes or by formulating shape prediction as a retrieval task followed by CAD model alignment [16, 17, 23, 24]. Directly predicting CAD model poses from 2D image features is difficult and inaccurate [23, 24]. Some works, such as… 

Figures and Tables from this paper



Leveraging Geometry for Shape Estimation from a Single RGB Image

This work demonstrates how cross-domain keypoint matches from an RGB image to a rendered CAD model allow for more precise object pose predictions compared to ones obtained through direct predictions, and shows that by allowing object stretching the authors can modify retrieved CAD models to better fit the observed shapes.

From Points to Multi-Object 3D Reconstruction

A key-point detector that localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes – all in a single forward pass is proposed, which enables a lightweight reconstruction of realistic and visually-pleasing shapes based on CAD-models.

ROCA: Robust CAD Model Retrieval and Alignment from a Single Image

Experiments on challenging, real-world imagery from ScanNet show that ROCA signif-icantly improves on state of the art, from 9.5% to 17.6% in retrieval-aware CAD alignment accuracy.

SceneCAD: Predicting Object Alignments and Layouts in RGB-D Scans

A message-passing graph neural network is proposed to model the inter-relationships between objects and layout, guiding generation of a globally object alignment in a scene by considering the global scene layout.

Vid2CAD: CAD Model Alignment Using Multi-View Constraints From Videos

The core idea of the method is to integrate neural network predictions from individual frames with a temporally global, multi-view constraint optimization formulation, which resolves the scale and depth ambiguities in the per-frame predictions, and generally improves the estimate of all pose parameters.

Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve

Mask2CAD is presented, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose, and constructs a joint embedding space between the detected regions of an image corresponding to an object and 3D CAD models, enabling retrieval of CAD models for an input RGB image.

Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild

A novel differentiable renderer that learns to approximate the rasterization backward pass from data instead of relying on a hand-crafted algorithm to perform a gradient-based optimization directly on the 3D pose.

A Point Set Generation Network for 3D Object Reconstruction from a Single Image

This paper addresses the problem of 3D reconstruction from a single image, generating a straight-forward form of output unorthordox, and designs architecture, loss function and learning paradigm that are novel and effective, capable of predicting multiple plausible 3D point clouds from an input image.

End-to-End Recovery of Human Shape and Pose

This work introduces an adversary trained to tell whether human body shape and pose parameters are real or not using a large database of 3D human meshes, and produces a richer and more useful mesh representation that is parameterized by shape and 3D joint angles.

DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation

This work introduces DeepSDF, a learned continuous Signed Distance Function (SDF) representation of a class of shapes that enables high quality shape representation, interpolation and completion from partial and noisy 3D input data.