Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
@article{Chen2021Scan2CapCD, title={Scan2Cap: Context-aware Dense Captioning in RGB-D Scans}, author={Dave Zhenyu Chen and Ali Gholami and Matthias Nie{\ss}ner and Angel Xuan Chang}, journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={3192-3202} }
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring…
Figures and Tables from this paper
19 Citations
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
- Computer ScienceIJCAI
- 2022
This work proposes a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where it especially investigates the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise andatiality-enhanced object caption generation.
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
2D Semantics Ass Training (SAT) is proposed that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding.
ScanQA: 3D Question Answering for Spatial Scene Understanding
- Computer ScienceArXiv
- 2021
A baseline model for 3D-QA is proposed, called the ScanQA 1, which learns a fused descriptor from 3D object proposals and encoded sentence embeddings that facilitates the regression of 3D bounding boxes to determine the described objects in textual questions.
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
- Computer ScienceArXiv
- 2022
This paper proposes MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions in 3D dense captioning and out-perform the current state-of-the-art method.
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
- Computer Science, Environmental ScienceArXiv
- 2022
The proposed X-Trans2Cap effectively boost the performance of single-modal 3D captioning through the knowledge distillation en-abled by a teacher-student framework and outperforms previous state-of-the-art models by a large margin.
3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language
- Computer Science2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
- 2022
The 3DRefTransformer net is introduced, a transformer-based neural network that identifies 3D objects described by linguistic utterances in real-world scenes described by a textual query and improves the performance upon the current SOTA significantly on Referit3D Nr3D and Sr3D datasets.
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
- Computer ScienceArXiv
- 2022
This paper extends 3DVG to a more reliable and explainable task, called 3D Phrase Aware Grounding, and proposes a novel framework, i.e. PhraseRefer, which conducts phrase-aware and object-level representation learning through phrase-object alignment optimization as well as phrase-specific pre-training.
D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans
- Computer ScienceArXiv
- 2021
The DNet is presented, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate and introduces discriminability during object caption generation and enables semisupervised training on ScanNet data with partially annotated descriptions.
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding
- Computer Science
- 2021
This work presents D 3 Net, an end-to-end neural speaker-listener architecture that can d etect, d escribe and d iscriminate, and outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds
- Environmental Science
Observing that the 3D captioning task and the 3D grounding task contain both shared and complementary information in nature, in this work, we propose a unified framework to jointly solve these two…
References
SHOWING 1-10 OF 60 REFERENCES
ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
- Computer ScienceECCV
- 2020
This work proposes ScanRefer, the first large-scale effort to perform object localization via natural language expression directly in 3D through learning a fused descriptor from 3D object proposals and encoded sentence embeddings.
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.
RevealNet: Seeing Behind Objects in RGB-D Scans
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
RevealNet is a new data-driven approach that jointly detects object instances and predicts their complete geometry, which enables a semantically meaningful decomposition of a scanned scene into individual, complete 3D objects, including hidden and unobserved object parts.
3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation
- Computer ScienceECCV
- 2018
3DMV is presented, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network that achieves significantly better results than existing baselines.
SUN RGB-D: A RGB-D scene understanding benchmark suite
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This paper introduces an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks, and presents a dataset that enables the train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias.
3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
3D-SIS is introduced, a novel neural network architecture for 3D semantic instance segmentation in commodity RGB-D scans that leverages high-resolution RGB input by associating 2D images with the volumetric grid based on the pose alignment of the 3D reconstruction.
Matterport3D: Learning from RGB-D Data in Indoor Environments
- Computer Science2017 International Conference on 3D Vision (3DV)
- 2017
Matterport3D is introduced, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400RGB-D images of 90 building-scale scenes that enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes
- Computer Science, Environmental Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work builds on top of VoteNet and proposes a 3D detection architecture called ImVoteNet specialized for RGB-D scenes, based on fusing 2D votes in images and 3D Votes in point clouds, advancing state-of-the-art results by 5.7 mAP.
3D-BEVIS: Birds-Eye-View Instance Segmentation
- Computer ScienceGCPR
- 2019
3D-BEVIS (3D bird’s-eye-view instance segmentation), a deep learning framework for joint semantic- and instance-segmentation on 3D point clouds is presented, which learns a feature embedding and groups the obtained feature space into semantic instances.
SceneNN: A Scene Meshes Dataset with aNNotations
- Computer Science2016 Fourth International Conference on 3D Vision (3DV)
- 2016
This paper introduces SceneNN, an RGB-D scene dataset consisting of 100 scenes that is used as a benchmark to evaluate the state-of-the-art methods on relevant research problems such as intrinsic decomposition and shape completion.