AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection

  title={AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection},
  author={Zehui Chen and Zhenyu Li and Shiquan Zhang and Liangji Fang and Qinhong Jiang and Feng Zhao},
Point clouds and RGB images are two general percep-tional sources in autonomous driving. The former can provide accurate localization of objects, and the latter is denser and richer in semantic information. Recently, AutoAlign [6] presents a learnable paradigm in combining these two modalities for 3D object detection. However, it suffers from high computational cost introduced by the global-wise attention. To solve the problem, we propose Cross-Domain DeformCAFA module in this work. It attends… 

Center Feature Fusion: Selective Multi-Sensor Fusion of Center-based Objects

This work proposes a novel approach Center Feature Fusion (CFF), in which the center-based detection networks in both the camera and LiDAR streams are used to identify relevant object locations and these are projected and fused in the BEV frame.

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

This paper develops an end-to-end learning-based model with a tiny weight size and a short inference time, and proposes a simple yet effective data augmentation strategy, called R 2 crop, to boost the model performance.

PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion

This paper proposes PathFusion, a path consistency loss between shallow and deep features, which encourages the 2D backbone and its fusion path to transform 2D features in a way that is semantically aligned with the transform of the 3D backbone.

3D Dual-Fusion: Dual-Domain Dual-Query Camera-LiDAR Fusion for 3D Object Detection

The proposed 3D Dual-Fusion architecture fuses the features of the camera-view and 3D voxel-view domain and models their interactions through deformable attention, and redesigns the transformer fusion encoder to aggregate the information from the two domains.

M^2-3DLaneNet: Multi-Modal 3D Lane Detection

The M 2 -3DLaneNet is proposed, a M ulti- M odal framework for effective 3D lane detection that outperforms previous state-of-the-art methods by a large margin, i.e. 12.1% F1- score improvement on OpenLane dataset.

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

A full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs are introduced, and the future research directions in this area are pointed out.

Vision-Centric BEV Perception: A Survey

—Vision-centric BEV perception has recently received increased attention from both industry and academia due to its inherent merits, including presenting a natural representation of the world and



AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection

This paper proposes AutoAlign, an automatic feature fusion strategy for 3D object detection that model the mapping relationship between the image and point clouds with a learnable alignment map, and designs a self-supervised cross-modal feature interaction module that can learn feature aggregation with instance-level feature guidance.

MVX-Net: Multimodal VoxelNet for 3D Object Detection

PointFusion and VoxelFusion are presented: two simple yet effective early-fusion approaches to combine the RGB and point cloud modalities, by leveraging the recently introduced VoxelNet architecture.

Multi-view 3D Object Detection Network for Autonomous Driving

This paper proposes Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes and designs a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths.

Improving Data Augmentation for Multi-Modality 3D Object Detection

  • Computer Science
  • 2021
A pipeline is contributed, named trans formation flow, to bridge the gap between single and multi-modality data augmentation with transformation reversing and replaying and presents Multi-mOdality Cut and pAste (MoCa), which considers occlusion and physical plausibility to maintain the multi- modality consistency.

Frustum PointNets for 3D Object Detection from RGB-D Data

This work directly operates on raw point clouds by popping up RGBD scans and leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects.

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

  • Yingwei LiA. Yu Mingxing Tan
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This paper proposes two novel techniques: InverseAug that inverses geometric-related augmentations, e.g., rotation, to enable accurate geometric alignment between lidar points and image pixels, and LearnableAlign that leverages cross-attention to dynamically capture the correlations between image and lidar features during fusion.

PointPainting: Sequential Fusion for 3D Object Detection

PointPainting is proposed, a sequential fusion method that combines lidar points into the output of an image-only semantic segmentation network and appending the class scores to each point, and how latency can be minimized through pipelining.

STD: Sparse-to-Dense 3D Object Detector for Point Cloud

This work proposes a two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD), and implements a parallel intersection-over-union (IoU) branch to increase awareness of localization accuracy, resulting in further improved performance.

End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds

This paper aims to synergize the birds-eye view and the perspective view and proposes a novel end-to-end multi-view fusion (MVF) algorithm, which can effectively learn to utilize the complementary information from both and significantly improves detection accuracy over the comparable single-view PointPillars baseline.

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes and establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% highermIoU on BEV map segmentation, with 1.9 × lower computation cost.