• Corpus ID: 245334889

ScanQA: 3D Question Answering for Spatial Scene Understanding

  title={ScanQA: 3D Question Answering for Spatial Scene Understanding},
  author={Daich Azuma and Taiki Miyanishi and Shuhei Kurita and Motoki Kawanabe},
We propose a new 3D spatial understanding task for 3D question answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of a rich RGB-D indoor scan and answer given textual questions about the 3D scene. Unlike the 2D-question answering of visual question answering, the conventional 2D-QA models suffer from problems with spatial understanding of object alignment and directions and fail in object localization from the textual questions in 3D-QA. We propose… 
Decomposing NeRF for Editing via Feature Field Distillation
This work tackles the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes, and distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors into a 3D feature field optimized in parallel to the radiance field.


Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
This work proposes Scan2Cap, an end-to-end trained method to detect objects in the input scene and describe them in natural language, which can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin.
ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
This work proposes ScanRefer, the first large-scale effort to perform object localization via natural language expression directly in 3D through learning a fused descriptor from 3D object proposals and encoded sentence embeddings.
3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation
3DMV is presented, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network that achieves significantly better results than existing baselines.
Embodied Question Answering in Photorealistic Environments With Point Cloud Perception
It is found that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.
Visual Question Answering on 360{\deg} Images.
The first VQA 360 dataset is collected, containing around 17,000 real-world image-question-answer triplets for a variety of question types and it is demonstrated that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models.
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.
Video Question Answering with Spatio-Temporal Reasoning
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Deep Hough Voting for 3D Object Detection in Point Clouds
This work proposes VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting that achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency.
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring
The non-trivial 3D visual grounding task has been effectively re-formulated as a simplified instance-matching problem, considering that instance-level candidates are more rational than the redundant 3D object proposals.