MonStereo: When Monocular and Stereo Meet at the Tail of 3D Human Localization

  title={MonStereo: When Monocular and Stereo Meet at the Tail of 3D Human Localization},
  author={Lorenzo Bertoni and Sven Kreiss and Taylor Mordan and Alexandre Alahi},
  journal={2021 IEEE International Conference on Robotics and Automation (ICRA)},
Monocular and stereo visions are cost-effective solutions for 3D human localization in the context of self-driving cars or social robots. However, they are usually developed independently and have their respective strengths and limitations. We propose a novel unified learning framework that leverages the strengths of both monocular and stereo cues for 3D human localization. Our method jointly (i) associates humans in left- right images, (ii) deals with occluded and distant cases in stereo… 
Are socially-aware trajectory prediction models really socially-aware?
This paper introduces a socially-attended attack to assess the social understanding of prediction models in terms of collision avoidance, and proposes hard and soft-attention mechanisms to guide the attack.
Perceiving Humans: from Monocular 3D Localization to Social Distancing
This work presents a new cost-effective vision-based method that perceives humans' locations in 3D and their body orientation from a single image, and shows that the concept of "social distancing" can be rethink as a form of social interaction in contrast to a simple location-based rule.


MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation
The architecture is a light-weight feed-forward neural network that predicts 3D locations and corresponding confidence intervals given 2D human poses that is particularly well suited for small training data, cross-dataset generalization, and real-time applications.
MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization
This work proposes a novel IDE method that directly predicts the depth of the targeting 3D bounding box's center using sparse supervision, and demonstrates that MonoGRNet achieves state-of-the-art performance on challenging datasets.
Joint Human Pose Estimation and Stereo 3D Localization
An end-to-end trainable Neural Network architecture for stereo imaging that jointly locates and estimates human body poses in 3D and is particularly suitable for autonomous vehicles is presented.
PifPaf: Composite Fields for Human Pose Estimation
The new PifPaf method, which uses a Part Intensity Field to localize body parts and a Part Association Field to associate body parts with each other to form full human poses, outperforms previous methods at low resolution and in crowded, cluttered and occluded scenes.
Triangulation Learning Network: From Monocular to Stereo 3D Object Detection
This paper proposes to employ 3D anchors to explicitly construct object-level correspondences between the regions of interest in stereo images, from which the deep neural network learns to detect and triangulate the targeted object in 3D space.
Stereo R-CNN Based 3D Object Detection for Autonomous Driving
Experiments show that the proposed Stereo R-CNN method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks.
Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction
MonopolyPSR, a monocular 3D object detection method that leverages proposals and shape reconstruction, is presented and a novel projection alignment loss is devised to jointly optimize these tasks in the neural network to improve 3D localization accuracy.
Orthographic Feature Transform for Monocular 3D Object Detection
The orthographic feature transform is introduced, which enables us to escape the image domain by mapping image-based features into an orthographic 3D space and allows us to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful.
Disentangling Monocular 3D Object Detection
An approach for monocular 3D object detection from a single RGB image, which leverages a novel disentangling transformation for 2D and 3D detection losses and a novel, self-supervised confidence score for 3D bounding boxes is proposed.
Monocular 3D Object Detection for Autonomous Driving
This work proposes an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.