TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

@article{Bai2022TransFusionRL,
  title={TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers},
  author={Xuyang Bai and Zeyu Hu and Xinge Zhu and Qingqiu Huang and Yilun Chen and Hongbo Fu and Chiew-Lan Tai},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.11496}
}
LiDAR and camera are two important sensors for 3D object detection in autonomous driving. Despite the increasing popularity of sensor fusion in this field, the robustness against inferior image conditions, e.g., bad illumination and sensor misalignment, is under-explored. Existing fusion methods are easily affected by such conditions, mainly due to a hard association of LiDAR points and image pixels, established by calibration matrices. We propose TransFusion, a robust solution to LiDAR-camera… 
BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework
TLDR
This work proposes a surprisingly simple yet novel fusion framework, dubbed BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods and is the first to handle realistic LiDar malfunction and can be deployed to realistic scenarios without any post-processing procedure.
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection
TLDR
This work collects a series of real-world cases with noisy data distribution, and systematically formulate a robustness benchmark toolkit, that simulates these cases on any clean autonomous driving datasets, and holistically benchmark the state-of-the-art fusion methods for the first time.
3D Object Detection for Autonomous Driving: A Review and New Outlooks
TLDR
This paper conducts a comprehensive survey of the progress in 3D object detection from the aspects of models and sensory inputs, including LiDAR-based, camera- based, and multi-modal detection approaches, and provides an in-depth analysis of the potentials and challenges in each category of methods.
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
TLDR
BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes and establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% highermIoU on BEV map segmentation, with 1.9 × lower computation cost.
AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection
TLDR
The Cross-Domain DeformCAFA module is proposed, a simple yet effective cross-modal augmentation strategy on convex combination of image patches given their depth information that enhances the tolerance to calibration error and greatly speeds up the feature aggregation across different modalities.
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
TLDR
This work proposes TransFuser, a mechanism to integrate image and LiDAR representations using self-attention, which outperforms all prior work on the CARLA leaderboard in terms of driving score and reduces the average collisions per kilometer.
LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation
TLDR
‘LaRa’ is presented, an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras that outperforms on nuScenes the best previous works using transformers.
Transformers for Multi-Object Tracking on Point Clouds
TLDR
A novel transformer-based end-to-end trainable online tracker and detector for point cloud data that utilizes a cross- and a self-attention mechanism and is applicable to lidar data in an automotive context, as well as other data types, such as radar.
Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds
TLDR
This work proposes Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations that improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challeng-ing nuScenes dataset and shows that by pretraining with Voxel the method requires only 40% of the annotated data to outperform a randomly initialized equivalent.
Scaling up Kernels in 3D CNNs
TLDR
This work presents the spatial-wise group convolution and its large-kernel module (SW-LK block) and shows that large kernels are feasible and essential for 3D networks for the first time.
...
...

References

SHOWING 1-10 OF 75 REFERENCES
LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation
TLDR
A coarse-to-fine LiDar and camera fusion-based network (termed as LIF-Seg) for LiDAR segmentation with superiority over existing methods with a large margin and an offset rectification approach to align these two-modality features.
Multimodal Virtual Point 3D Detection
TLDR
This work presents an approach to seamlessly fuse RGB sensors into Lidar-based 3D recognition, and shows that this framework improves a strong CenterPoint baseline by a significant 6.6 mAP, and outperforms competing fusion approaches.
PointAugmenting: Cross-Modal Augmentation for 3D Object Detection
TLDR
PointAugmenting decorates point clouds with corresponding point-wise CNN features extracted by pretrained 2D detection models, and then performs 3D object detection over the decorated point clouds and achieves the new state-of-the-art results on the nuScenes leaderboard to date.
EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection
TLDR
A novel fusion module is proposed to enhance the point features with semantic image features in a point-wise manner without any image annotations to address two critical issues in the 3D detection task, including the exploitation of multiple sensors~ and the inconsistency between the localization and classification confidence.
LiDAR R-CNN: An Efficient and Universal 3D Object Detector
TLDR
Comprehensive experimental results on real-world datasets like Waymo Open Dataset (WOD) and KITTI dataset with various popular detectors demonstrate the universality and superiority of the LiDAR R-CNN.
Deep Continuous Fusion for Multi-sensor 3D Object Detection
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable
PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module
TLDR
A novel fusion approach named Point-based Attentive Cont-conv Fusion(PACF) module, which fuses multi-sensor features directly on 3D points and a 3D multi-Sensor multi-task network called Pointcloud-Image RCNN(PI-RCNN as brief), which handles the image segmentation and 3D object detection tasks.
PointPillars: Fast Encoders for Object Detection From Point Clouds
TLDR
benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds, and proposes a lean downstream network.
Frustum PointNets for 3D Object Detection from RGB-D Data
TLDR
This work directly operates on raw point clouds by popping up RGBD scans and leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects.
Sensor Fusion for Joint 3D Object Detection and Semantic Segmentation
TLDR
An extension to LaserNet, an efficient and state-of-the-art LiDAR based 3D object detector, is presented and a method for fusing image data with the LiDar data is proposed and shown to improve the detection performance of the model especially at long ranges.
...
...