Learning to Evaluate Perception Models Using Planner-Centric Metrics

@article{Philion2020LearningTE,
  title={Learning to Evaluate Perception Models Using Planner-Centric Metrics},
  author={Jonah Philion and Amlan Kar and Sanja Fidler},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={14052-14061}
}
Variants of accuracy and precision are the gold-standard by which the computer vision community measures progress of perception algorithms. One reason for the ubiquity of these metrics is that they are largely task-agnostic; we in general seek to detect zero false negatives or positives. The downside of these metrics is that, at worst, they penalize all incorrect detections equally without conditioning on the task or scene, and at best, heuristics need to be chosen to ensure that different… 
The efficacy of Neural Planning Metrics: A meta-analysis of PKL on nuScenes
TLDR
A neural planning metric based on the KL divergence of a planner's trajectory and the groundtruth route is used to score all submissions of the nuScenes detection challenge and it is found that while somewhat correlated with mAP, the PKL metric shows different behavior to increased traffic density, ego velocity, road curvature and intersections.
From Evaluation to Verification: Towards Task-oriented Relevance Metrics for Pedestrian Detection in Safety-critical Domains
TLDR
This work considers pedestrian detection as a highly relevant perception task, and it is argued that standard measures such as Intersection over Union (IoU) give insufficient results, mainly because they are insensitive to important physical cues including distance, speed, and direction of motion.
LiDAR Cluster First and Camera Inference Later: A New Perspective Towards Autonomous Driving
TLDR
This paper presents a new end-to-end pipeline for AV that introduces the concept of LiDAR cluster first and camera inference later to detect and classify objects, and shows that this novel object detection pipeline prioritizes the detection of higher risk objects while simultaneously achieving comparable accuracy and a 25% higher average speed compared to camera inference only.
A Step Towards Efficient Evaluation of Complex Perception Tasks in Simulation
TLDR
This work proposes an approach that enables efficient large-scale testing using simplified low-fidelity simulators and without the computational cost of executing expensive deep learning models, and designs an efficient surrogate model corresponding to the compute intensive components of the task under test.
M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation
TLDR
M 2 BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed, and achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks.
Deep Multi-Task Learning for Joint Localization, Perception, and Prediction
TLDR
A system that jointly performs perception, prediction, and localization is designed, which is able to reuse computation between the three tasks, and is thus able to correct localization errors efficiently.
3D Object Detection for Autonomous Driving: A Review and New Outlooks
TLDR
This paper conducts a comprehensive survey of the progress in 3D object detection from the aspects of models and sensory inputs, including LiDAR-based, camera- based, and multi-modal detection approaches, and provides an in-depth analysis of the potentials and challenges in each category of methods.
Injecting Planning-Awareness into Prediction and Detection Evaluation
TLDR
Experiments on an illustrative simulation as well as real-world autonomous driving data validate that the proposed task-aware metrics are able to account for outcome asymmetry and provide a better estimate of a model’s closed-loop performance.
Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D
TLDR
In pursuit of the goal of learning dense representations for motion planning, it is shown that the representations inferred by the model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by the network.
Quantity over Quality: Training an AV Motion Planner with Large Scale Commodity Vision Data
TLDR
This work shows it is possible to train a high-performance motion planner using commodity vision data which outperforms planners trained on HD-sensor data for a fraction of the cost, and is the first to demonstrate that this is possible using real-world data.
...
...

References

SHOWING 1-10 OF 39 REFERENCES
Monocular 3D Object Detection for Autonomous Driving
TLDR
This work proposes an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.
End to End Learning for Self-Driving Cars
TLDR
A convolutional neural network is trained to map raw pixels from a single front-facing camera directly to steering commands and it is argued that this will eventually lead to better performance and smaller systems.
PointPillars: Fast Encoders for Object Detection From Point Clouds
TLDR
benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds, and proposes a lean downstream network.
Disentangling Monocular 3D Object Detection
TLDR
An approach for monocular 3D object detection from a single RGB image, which leverages a novel disentangling transformation for 2D and 3D detection losses and a novel, self-supervised confidence score for 3D bounding boxes is proposed.
nuScenes: A Multimodal Dataset for Autonomous Driving
Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object
PIXOR: Real-time 3D Object Detection from Point Clouds
TLDR
PIXOR is proposed, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixel-wise neural network predictions that surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at 10 FPS.
SECOND: Sparsely Embedded Convolutional Detection
TLDR
An improved sparse convolution method for Voxel-based 3D convolutional networks is investigated, which significantly increases the speed of both training and inference and introduces a new form of angle loss regression to improve the orientation estimation performance.
Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection
TLDR
This report presents the method which wins the nuScenes3D Detection Challenge, and proposes a balanced group-ing head to boost the performance for the categories withsimilar shapes, achieving state-of-the-art detection performance on thenuScenes dataset.
STD: Sparse-to-Dense 3D Object Detector for Point Cloud
TLDR
This work proposes a two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD), and implements a parallel intersection-over-union (IoU) branch to increase awareness of localization accuracy, resulting in further improved performance.
Bag of Freebies for Training Object Detection Neural Networks
TLDR
This work explores training tweaks that apply to various models including Faster R-CNN and YOLOv3 that can improve up to 5% absolute precision compared to state-of-the-art baselines.
...
...