Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

@article{Chen2021Scan2CapCD,
  title={Scan2Cap: Context-aware Dense Captioning in RGB-D Scans},
  author={Dave Zhenyu Chen and Ali Gholami and Matthias Nie{\ss}ner and Angel Xuan Chang},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={3192-3202}
}
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring… 
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
TLDR
This work proposes a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where it especially investigates the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise andatiality-enhanced object caption generation.
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
TLDR
2D Semantics Ass Training (SAT) is proposed that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding.
ScanQA: 3D Question Answering for Spatial Scene Understanding
TLDR
A baseline model for 3D-QA is proposed, called the ScanQA 1, which learns a fused descriptor from 3D object proposals and encoded sentence embeddings that facilitates the regression of 3D bounding boxes to determine the described objects in textual questions.
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
TLDR
This paper proposes MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions in 3D dense captioning and out-perform the current state-of-the-art method.
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
TLDR
The proposed X-Trans2Cap effectively boost the performance of single-modal 3D captioning through the knowledge distillation en-abled by a teacher-student framework and outperforms previous state-of-the-art models by a large margin.
3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language
TLDR
The 3DRefTransformer net is introduced, a transformer-based neural network that identifies 3D objects described by linguistic utterances in real-world scenes described by a textual query and improves the performance upon the current SOTA significantly on Referit3D Nr3D and Sr3D datasets.
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
TLDR
This paper extends 3DVG to a more reliable and explainable task, called 3D Phrase Aware Grounding, and proposes a novel framework, i.e. PhraseRefer, which conducts phrase-aware and object-level representation learning through phrase-object alignment optimization as well as phrase-specific pre-training.
D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans
TLDR
The DNet is presented, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate and introduces discriminability during object caption generation and enables semisupervised training on ScanNet data with partially annotated descriptions.
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding
TLDR
This work presents D 3 Net, an end-to-end neural speaker-listener architecture that can d etect, d escribe and d iscriminate, and outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds
Observing that the 3D captioning task and the 3D grounding task contain both shared and complementary information in nature, in this work, we propose a unified framework to jointly solve these two
...
...

References

SHOWING 1-10 OF 60 REFERENCES
ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
TLDR
This work proposes ScanRefer, the first large-scale effort to perform object localization via natural language expression directly in 3D through learning a fused descriptor from 3D object proposals and encoded sentence embeddings.
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
TLDR
This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.
RevealNet: Seeing Behind Objects in RGB-D Scans
TLDR
RevealNet is a new data-driven approach that jointly detects object instances and predicts their complete geometry, which enables a semantically meaningful decomposition of a scanned scene into individual, complete 3D objects, including hidden and unobserved object parts.
3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation
TLDR
3DMV is presented, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network that achieves significantly better results than existing baselines.
SUN RGB-D: A RGB-D scene understanding benchmark suite
TLDR
This paper introduces an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks, and presents a dataset that enables the train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias.
3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans
TLDR
3D-SIS is introduced, a novel neural network architecture for 3D semantic instance segmentation in commodity RGB-D scans that leverages high-resolution RGB input by associating 2D images with the volumetric grid based on the pose alignment of the 3D reconstruction.
Matterport3D: Learning from RGB-D Data in Indoor Environments
TLDR
Matterport3D is introduced, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400RGB-D images of 90 building-scale scenes that enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes
TLDR
This work builds on top of VoteNet and proposes a 3D detection architecture called ImVoteNet specialized for RGB-D scenes, based on fusing 2D votes in images and 3D Votes in point clouds, advancing state-of-the-art results by 5.7 mAP.
3D-BEVIS: Birds-Eye-View Instance Segmentation
TLDR
3D-BEVIS (3D bird’s-eye-view instance segmentation), a deep learning framework for joint semantic- and instance-segmentation on 3D point clouds is presented, which learns a feature embedding and groups the obtained feature space into semantic instances.
SceneNN: A Scene Meshes Dataset with aNNotations
TLDR
This paper introduces SceneNN, an RGB-D scene dataset consisting of 100 scenes that is used as a benchmark to evaluate the state-of-the-art methods on relevant research problems such as intrinsic decomposition and shape completion.
...
...