Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

  title={Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe},
  author={Hongyang Li and Chonghao Sima and Jifeng Dai and Wenhai Wang and Lewei Lu and Huijie Wang and Enze Xie and Zhiqi Li and Hanming Deng and Haonan Tian and Xizhou Zhu and Li Chen and Yulu Gao and Xiangwei Geng and Jianqiang Zeng and Yang Li and Jiazhi Yang and Xiaosong Jia and Bo Yu and Y. Qiao and Dahua Lin and Siqian Liu and Junchi Yan and Jianping Shi and Ping Luo},
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance. BEV perception inherits… 

Geometric-aware Pretraining for Vision-centric 3D Object Detection

This work proposes a novel geometric-aware pretraining framework called GAPretrain, which incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase and serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.

Occ-BEV: Multi-Camera Unified Pre-training via 3D Scene Reconstruction

A novel multi-camera unified pre-training framework called Occ-BEV, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks, demonstrates promising results in key tasks such as multi- camera 3D object detection and semantic scene completion.

DeepSTEP - Deep Learning-Based Spatio-Temporal End-To-End Perception for Autonomous Vehicles

The concept for an end-to-end perception architecture that combines detection and localization into a single pipeline allows for efficient processing to reduce computational overhead and further improves overall performance, and is a promising solution for real-world deployment.

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

The above refinement module could be stacked in a cascaded fashion, which extends the capacity of the decoder with spatial-temporal prior knowledge about the conditioned future, and achieves state-of-the-art performance in closed-loop benchmarks.

Bi-Mapper: Holistic BEV Semantic Mapping for Autonomous Driving

A Bi-Mapper framework for top-down road-scene semantic understanding, which incorporates a global view and local prior knowledge and an asynchronous mutual learning strategy is proposed to enhance reliable interaction between them.

Fusion is Not Enough: Single-Modal Attacks to Compromise Fusion Models in Autonomous Driving

It is argued that the weakest link of fusion models depends on their most vulnerable modality, and an attack framework that targets advanced camera-LiDAR fusion models with adversarial patches is proposed, demonstrating the effectiveness and practicality of the proposed attack framework.

Road Genome: A Topology Reasoning Benchmark for Scene Understanding in Autonomous Driving

The goal of Road Genome is to understand the scene structure by investigating the relationship of perceived entities among traffic elements and lanes by introducing OpenLane-V2, the newly minted benchmark.

Sparse Dense Fusion for 3D Object Detection

Sparse Dense Fusion (SDF), a complementary framework that incorporates both sparse-fusion and dense-fusions modules via the Transformer architecture, is proposed, a simple yet effective sparse-dense fusion structure that enriches semantic texture and exploits spatial structure information simultaneously.

Understanding the Robustness of 3D Object Detection with Bird's-Eye-View Representations in Autonomous Driving

This paper evaluates the natural and adversarial robustness of various representative models under extensive settings to fully understand their behaviors influenced by explicit BEV features compared with those without BEV, and proposes a 3D consistent patch attack by applying adversarial patches in the 3D space to guarantee the spatiotemporal consistency.

Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving

This paper designs 27 types of common corruptions for both LiDAR and camera inputs considering real-world driving scenarios and conducts large-scale experiments on 24 diverse 3D object detection models to evaluate their corruption robustness, drawing several important findings.

Scalability in Perception for Autonomous Driving: Waymo Open Dataset

This work introduces a new large scale, high quality, diverse dataset, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies, and studies the effects of dataset size and generalization across geographies on 3D detection methods.

nuScenes: A Multimodal Dataset for Autonomous Driving

Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object

Are we ready for autonomous driving? The KITTI vision benchmark suite

The autonomous driving platform is used to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection, revealing that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world.

BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework

This work proposes a surprisingly simple yet novel fusion framework, dubbed BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods and is the first to handle realistic LiDar malfunction and can be deployed to realistic scenarios without any post-processing procedure.

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

The Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain that supports self-supervised learning and the emerging task of point cloud forecasting is introduced.

Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

In pursuit of the goal of learning dense representations for motion planning, it is shown that the representations inferred by the model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by the network.

Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

This work proposes Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch, and presents 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively.

One Thousand and One Hours: Self-driving Motion Prediction Dataset

This collection was collected by a fleet of 20 autonomous vehicles along a fixed route in Palo Alto, California over a four-month period and forms the largest, most complete and detailed dataset to date for the development of self-driving, machine learning tasks such as motion forecasting, planning and simulation.

Inverse perspective mapping simplifies optical flow computation and obstacle detection

It turns out that besides obstacle detection, inverse perspective mapping has additional advantages for regularizing optical flow algorithms.