• Corpus ID: 232417286

Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud

  title={Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud},
  author={Mingtao Feng and Zhen Li and Qi Li and Liang Zhang and XiangDong Zhang and Guangming Zhu and Hui Zhang and Yaonan Wang and Ajmal S. Mian},
3D object grounding aims to locate the most relevant target object in a raw point cloud scene based on a freeform language description. Understanding complex and diverse descriptions, and lifting them directly to a point cloud is a new and challenging topic due to the irregular and sparse nature of point clouds. There are three main challenges in 3D object grounding: to find the main focus in the complex and diverse description; to understand the point cloud scene; and to locate the target… 

Figures and Tables from this paper

SAT: 2D Semantics Assisted Training for 3D Visual Grounding
2D Semantics Ass Training (SAT) is proposed that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding.
LanguageRefer: Spatial-Language Model for 3D Visual Grounding
This paper introduces a spatial-language model that combines spatial embedding from bounding boxes with fine-tuned language embeddings from DistilBert to predict the target object and shows that it performs competitively on visio-linguistic datasets proposed by ReferIt3D.
3D Question Answering
  • Shuquan Ye, Dongdong Chen, Songfang Han, Jing Liao
  • Computer Science
  • 2021
The first attempt at extending VQA to the 3D domain is presented, which can facilitate artificial intelligence’s perception of 3D real-world scenarios and contains the first 3DQA dataset “ScanQA”, which builds on the ScanNet dataset and contains∼6K questions,∼30K answers for 806 scenes.
Looking Outside the Box to Ground Language in 3D Scenes
A model for grounding language in 3D scenes that bypasses box proposal bottlenecks with three main innovations, which result in significant quantitative gains over previous approaches on popular 3D language grounding benchmarks.


A Hierarchical Graph Network for 3D Object Detection on Point Clouds
A new graph convolution (GConv) based hierarchical graph network (HGNet) for 3D object detection, which processes raw point clouds directly to predict 3D bounding boxes and outperforms state-of-the-art methods on two large-scale point cloud datasets.
Relation Graph Network for 3D Object Detection in Point Clouds
A strategy that associates the predictions of direction vectors with pseudo geometric centers is proposed, leading to a win-win solution for 3D bounding box candidates regression and the effect of relation graphs on proposals’ appearance feature enhancement under supervised and unsupervised settings is explored.
Deep Hough Voting for 3D Object Detection in Point Clouds
This work proposes VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting that achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency.
VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
  • Yin Zhou, Oncel Tuzel
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
VoxelNet is proposed, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network and learns an effective discriminative representation of objects with various geometries, leading to encouraging results in3D detection of pedestrians and cyclists.
Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud
  • Weijing Shi, R. Rajkumar
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
The proposed graph neural network, named Point-GNN, is designed to predict the category and shape of the object that each vertex in the graph belongs to, and also design a box merging and scoring operation to combine detections from multiple vertices accurately.
Multi-view 3D Object Detection Network for Autonomous Driving
This paper proposes Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes and designs a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths.
Learning Cross-modal Context Graph for Visual Grounding
A language-guided graph representation is proposed to capture the global context of grounding entities and their relations, and a cross-modal graph matching strategy for the multiple-phrase visual grounding task is developed.
Real-Time Referring Expression Comprehension by Single-Stage Grounding Network
The proposed Single-Stage Grounding network is time efficient and can ground a referring expression in a 416*416 image from the RefCOCO dataset in 25ms (40 referents per second) on average with a Nvidia Tesla P40, accomplishing more than 9* speedups over the existing multi-stage models.
PIXOR: Real-time 3D Object Detection from Point Clouds
PIXOR is proposed, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixel-wise neural network predictions that surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at 10 FPS.
A Fast and Accurate One-Stage Approach to Visual Grounding
A simple, fast, and accurate one-stage approach to visual grounding that enables end-to-end joint optimization and shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension.