TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

  title={TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers},
  author={Xuyang Bai and Zeyu Hu and Xinge Zhu and Qingqiu Huang and Yilun Chen and Hongbo Fu and Chiew-Lan Tai},
LiDAR and camera are two important sensors for 3D object detection in autonomous driving. Despite the increasing popularity of sensor fusion in this field, the robustness against inferior image conditions, e.g., bad illumination and sensor misalignment, is under-explored. Existing fusion methods are easily affected by such conditions, mainly due to a hard association of LiDAR points and image pixels, established by calibration matrices. We propose TransFusion, a robust solution to LiDARcamera… 
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection
This work collects a series of real-world cases with noisy data distribution, and systematically formulate a robustness benchmark toolkit, that simulates these cases on any clean autonomous driving datasets, and holistically benchmark the state-of-the-art fusion methods for the first time.
BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework
This work proposes a surprisingly simple yet novel fusion framework, dubbed BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods and is the first to handle realistic LiDar malfunction and can be deployed to realistic scenarios without any post-processing procedure.
3D Object Detection for Autonomous Driving: A Review and New Outlooks
This paper conducts a comprehensive survey of the progress in 3D object detection from the aspects of models and sensory inputs, including LiDAR-based, camera- based, and multi-modal detection approaches and provides an in-depth analysis of the potentials and challenges in each category of methods.
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes and establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% highermIoU on BEV map segmentation, with 1.9 × lower computation cost.
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
This work proposes TransFuser, a mechanism to integrate image and LiDAR representations using self-attention, which outperforms all prior work on the CARLA leaderboard in terms of driving score and reduces the average collisions per kilometer.
Transformers for Multi-Object Tracking on Point Clouds
—We present TransMOT, a novel transformer-based end-to-end trainable online tracker and detector for point cloud data. The model utilizes a cross- and a self-attention mechanism and is applicable to
Scaling up Kernels in 3D CNNs
This work presents the spatial-wise group convolution and its large-kernel module (SW-LK block) and shows that large kernels are feasible and essential for 3D networks for the first time.
Transformers in 3D Point Clouds: A Survey
This survey aims to provide a compre- hensive overview of 3D Transformers designed for various tasks, and compares the performance of Transformer-based algorithms in terms of point cloud classification, segmentation, and object detection.
A Survey of Visual Transformers
This survey has reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, and proposed the deformable attention module which combines the best of the sparse spatial sampling of deformable convo- lution, and the relation modeling capability of Transformers.
LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation
Recent works in autonomous driving have widely adopted the bird’seye-view (BEV) semantic map as an intermediate representation of the world. Online prediction of these BEV maps involves non-trivial


LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation
A coarse-to-fine LiDar and camera fusion-based network (termed as LIF-Seg) for LiDAR segmentation with superiority over existing methods with a large margin and an offset rectification approach to align these two-modality features.
3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection
In this paper, we propose a new deep architecture for fusing camera and LiDAR sensors for 3D object detection. Because the camera and LiDAR sensor signals have different characteristics and
Multimodal Virtual Point 3D Detection
This work presents an approach to seamlessly fuse RGB sensors into Lidar-based 3D recognition, and shows that this framework improves a strong CenterPoint baseline by a significant 6.6 mAP, and outperforms competing fusion approaches.
PointAugmenting: Cross-Modal Augmentation for 3D Object Detection
PointAugmenting decorates point clouds with corresponding point-wise CNN features extracted by pretrained 2D detection models, and then performs 3D object detection over the decorated point clouds and achieves the new state-of-the-art results on the nuScenes leaderboard to date.
PointPainting: Sequential Fusion for 3D Object Detection
PointPainting is proposed, a sequential fusion method that combines lidar points into the output of an image-only semantic segmentation network and appending the class scores to each point, and how latency can be minimized through pipelining.
EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection
A novel fusion module is proposed to enhance the point features with semantic image features in a point-wise manner without any image annotations to address two critical issues in the 3D detection task, including the exploitation of multiple sensors~ and the inconsistency between the localization and classification confidence.
LiDAR R-CNN: An Efficient and Universal 3D Object Detector
Comprehensive experimental results on real-world datasets like Waymo Open Dataset (WOD) and KITTI dataset with various popular detectors demonstrate the universality and superiority of the LiDAR R-CNN.
Deep Continuous Fusion for Multi-sensor 3D Object Detection
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable
PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module
A novel fusion approach named Point-based Attentive Cont-conv Fusion(PACF) module, which fuses multi-sensor features directly on 3D points and a 3D multi-Sensor multi-task network called Pointcloud-Image RCNN(PI-RCNN as brief), which handles the image segmentation and 3D object detection tasks.
PointPillars: Fast Encoders for Object Detection From Point Clouds
benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds, and proposes a lean downstream network.