NEAT: Neural Attention Fields for End-to-End Autonomous Driving

  title={NEAT: Neural Attention Fields for End-to-End Autonomous Driving},
  author={Kashyap Chitta and Aditya Prakash and Andreas Geiger},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
Efficient reasoning about the semantic, spatial, and temporal structure of a scene is a crucial prerequisite for autonomous driving. We present NEural ATtention fields (NEAT), a novel representation that enables such reasoning for end-to-end imitation learning models. NEAT is a continuous function which maps locations in Bird’s Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact… 

Figures and Tables from this paper

NMR: Neural Manifold Representation for Autonomous Driving

This work proposes Neural Manifold Representation (NMR), a representation for the task of autonomous driving that learns to infer semantics and predict way-points on a manifold over a horizon, centered on the ego-vehicle, using an iterative attention applied on a latent high dimensional embedding of surround monocular images and partial ego-Vehicle state.

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

This paper proposes a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously, which is called ST-P3 and is the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.

MMFN: Multi-Modal-Fusion-Net for End-to-End Driving

This work proposes a novel approach to extract features from vectorized High-Definition (HD) maps and utilize them in the end-to-end driving tasks and designs a new expert to further enhance the model performance by considering multi-road rules.

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

A new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks, and achieves the new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set.

DeepIPC: Deeply Integrated Perception and Control for Mobile Robot in Real Environments

DeepIPC, an end-to-end multi-task model that handles both perception and control tasks in driving a mobile robot autonomously achieves the best drivability and multi- task performance even with fewer parameters compared to the other models.

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

This work proposes TransFuser, a mechanism to integrate image and LiDAR representations using self-attention, which outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin.

Neural Fields in Visual Computing and Beyond

A review of the literature on neural fields shows the breadth of topics already covered in visual computing, both historically and in current incarnations, and highlights the improved quality, flexibility, and capability brought by neural field methods.

Neural Fields in Visual Computing

A review of the literature on neural fields shows the breadth of topics already covered in visual computing, both historically and in current incarnations, and highlights the improved quality, flexibility, and capability brought by neural field methods.

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

A full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs are introduced, and the future research directions in this area are pointed out.

GitNet: Geometric Prior-based Transformation for Birds-Eye-View Segmentation

A novel two-stage pipeline to transform the perspective view into the birds-eye-view, which performs geometry-guided pre-alignment and then further enhances the BEV features based on ray-based transformers, which can also be easily adapted to multi-view scenarios to build a full-scene BEV map.



SAM: Squeeze-and-Mimic Networks for Conditional Visual Driving Policy Learning

We describe a policy learning approach to map visual inputs to driving controls conditioned on turning command that leverages side tasks on semantics and object affordances via a learned

Conditional Affordance Learning for Driving in Urban Environments

This work proposes a direct perception approach which maps video input to intermediate representations suitable for autonomous navigation in complex urban environments given high-level directional inputs, and is the first to handle traffic lights and speed signs by using image-level labels only.

Driving Policy Transfer via Modularity and Abstraction

This work presents an approach to transferring driving policies from simulation to reality via modularity and abstraction, inspired by classic driving systems and aims to combine the benefits of modular architectures and end-to-end deep learning approaches.

Action-Based Representation Learning for Autonomous Driving

This work shows that an affordance-based driving model pre-trained with this approach can leverage a relatively small amount of weakly annotated imagery and outperform pure end-to-end driving models, while being more interpretable.

Label Efficient Visual Abstractions for Autonomous Driving

This work seeks to quantify the impact of reducing segmentation annotation costs on learned behavior cloning agents, and finds that state-of-the-art driving performance can be achieved with orders of magnitude reduction in annotation cost.

End-To-End Interpretable Neural Motion Planner

A holistic model that takes as input raw LIDAR data and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future trajectories, as well as a cost volume defining the goodness of each position that the self-driving car can take within the planning horizon, is designed.

End-to-End Driving Via Conditional Imitation Learning

This work evaluates different architectures for conditional imitation learning in vision-based driving and conducts experiments in realistic three-dimensional simulations of urban driving and on a 1/5 scale robotic truck that is trained to drive in a residential area.

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

This work demonstrates that imitation learning policies based on existing sensor fusion methods under-perform in the presence of a high density of dynamic agents and complex scenarios, which require global contextual reasoning, and proposes TransFuser, a novel Multi-Modal Fusion Transformer to integrate image and LiDAR representations using attention.

Perceive, Attend, and Drive: Learning Spatial Attention for Safe Self-Driving

An end-to-end self-driving network featuring a sparse attention module that learns to automatically attend to important regions of the input is proposed, which significantly improves the planner safety by performing more focused computation.

End to End Learning for Self-Driving Cars

A convolutional neural network is trained to map raw pixels from a single front-facing camera directly to steering commands and it is argued that this will eventually lead to better performance and smaller systems.