• Corpus ID: 245006203

SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

  title={SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations},
  author={Zhenyu Li and Zehui Chen and Ang Li and Liangji Fang and Qinhong Jiang and Xianming Liu and Junjun Jiang and Bolei Zhou and Hang Zhao},
Pre-training has become a standard paradigm in many computer vision tasks. However, most of the methods are generally designed on the RGB image domain. Due to the discrepancy between the two-dimensional image plane and the three-dimensional space, such pre-trained models fail to perceive spatial information and serve as sub-optimal solutions for 3D-related tasks. To bridge this gap, we aim to learn a spatial-aware visual representation that can describe the three-dimensional space and is more… 
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation
The proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins and achieves the most competitive result on the highly competitive KITTI depth estimation benchmark.
3D Object Detection for Autonomous Driving: A Review and New Outlooks
This paper conducts a comprehensive survey of the progress in 3D object detection from the aspects of models and sensory inputs, including LiDAR-based, camera- based, and multi-modal detection approaches and provides an in-depth analysis of the potentials and challenges in each category of methods.


PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding
This work aims at facilitating research on 3D representation learning by selecting a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes and achieving improvement over recent best results in segmentation and detection across 6 different benchmarks.
CoCoNets: Continuous Contrastive 3D Scene Representations
This model outperform many existing state-of-the-art methods for 3D feature learning and view prediction, which are either limited by 3D grid spatial resolution, do not attempt to build amodal 3D representations, or do not handle combinatorial scene variability due to their non-convolutional bottlenecks.
Pri3D: Can 3D Priors Help 2D Representation Learning?
This work proposes to employ contrastive learning under both multi-view image constraints and image-geometry constraints to encode 3D priors into learned 2D representations, which results in improvement over 2D-only representation learning on the image-based tasks of semantic segmentation, instance segmentation and object detection on real-world indoor datasets.
Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts
This study reveals that exhaustive labelling of 3D point clouds might be unnecessary; and remarkably, on ScanNet, even using 0.1% of point labels, the method achieves state-of-the-art results on a suite of benchmarks where training data or labels are scarce.
MVX-Net: Multimodal VoxelNet for 3D Object Detection
PointFusion and VoxelFusion are presented: two simple yet effective early-fusion approaches to combine the RGB and point cloud modalities, by leveraging the recently introduced VoxelNet architecture.
Multi-View Adaptive Fusion Network for 3D Object Detection
An attentive pointwise fusion (APF) module to estimate the importance of the three sources with attention mechanisms that can achieve adaptive fusion of multi-view features in a pointwise manner is proposed and an end-to-end learnable network named MVAF-Net is designed to integrate these two components.
Multi-view 3D Object Detection Network for Autonomous Driving
This paper proposes Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes and designs a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths.
Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining
The pixel-to-point knowledge transfer is proposed to effectively utilize the 2D information by mapping the pixel-level and point-level features into the same embedding space and the back-projection function is introduced to align the features between 2D and 3D to make the transfer possible.
FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection
The solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020 and proposes a general framework FCOS3D, getting rid of any 2D detection or 2D-3D correspondence priors.
Multimodal Contrastive Training for Visual Representation Learning
This work develops an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives, and exploits intrinsic data properties within each modality and semantic information from cross- modal correlation simultaneously, hence improving the quality of learned visual representations.