ScanQA: 3D Question Answering for Spatial Scene Understanding
@article{Azuma2021ScanQA3Q, title={ScanQA: 3D Question Answering for Spatial Scene Understanding}, author={Daich Azuma and Taiki Miyanishi and Shuhei Kurita and Motoki Kawanabe}, journal={ArXiv}, year={2021}, volume={abs/2112.10482} }
We propose a new 3D spatial understanding task for 3D question answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of a rich RGB-D indoor scan and answer given textual questions about the 3D scene. Unlike the 2D-question answering of visual question answering, the conventional 2D-QA models suffer from problems with spatial understanding of object alignment and directions and fail in object localization from the textual questions in 3D-QA. We propose…
Figures and Tables from this paper
One Citation
Decomposing NeRF for Editing via Feature Field Distillation
- Computer ScienceArXiv
- 2022
This work tackles the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes, and distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors into a 3D feature field optimized in parallel to the radiance field.
References
SHOWING 1-10 OF 52 REFERENCES
Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work proposes Scan2Cap, an end-to-end trained method to detect objects in the input scene and describe them in natural language, which can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin.
ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
- Computer ScienceECCV
- 2020
This work proposes ScanRefer, the first large-scale effort to perform object localization via natural language expression directly in 3D through learning a fused descriptor from 3D object proposals and encoded sentence embeddings.
3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation
- Computer ScienceECCV
- 2018
3DMV is presented, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network that achieves significantly better results than existing baselines.
Embodied Question Answering in Photorealistic Environments With Point Cloud Perception
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
It is found that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.
Visual Question Answering on 360{\deg} Images.
- Computer Science
- 2020
The first VQA 360 dataset is collected, containing around 17,000 real-world image-question-answer triplets for a variety of question types and it is demonstrated that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models.
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.
Video Question Answering with Spatio-Temporal Reasoning
- Computer ScienceInternational Journal of Computer Vision
- 2019
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
VQA: Visual Question Answering
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language…
Deep Hough Voting for 3D Object Detection in Point Clouds
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work proposes VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting that achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency.
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
The non-trivial 3D visual grounding task has been effectively re-formulated as a simplified instance-matching problem, considering that instance-level candidates are more rational than the redundant 3D object proposals.