Seeing Behind Objects for 3D Multi-Object Tracking in RGB-D Sequences

@article{Mller2021SeeingBO,
  title={Seeing Behind Objects for 3D Multi-Object Tracking in RGB-D Sequences},
  author={Norman M{\"u}ller and Yu-Shiang Wong and Niloy Jyoti Mitra and Angela Dai and Matthias Nie{\ss}ner},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={6067-6076}
}
Multi-object tracking from RGB-D video sequences is a challenging problem due to the combination of changing viewpoints, motion, and occlusions over time. We observe that having the complete geometry of objects aids in their tracking, and thus propose to jointly infer the complete geometry of objects as well as track them, for rigidly moving objects over time. Our key insight is that inferring the complete geometry of the objects significantly helps in tracking. By hallucinating unseen regions… 

Figures and Tables from this paper

ObjectFusion: Accurate Object-level SLAM with Neural Object Priors
TLDR
The ObjectFusion is provided as a novel objectlevel SLAM which efficiently creates object-oriented 3D map with high-quality object reconstruction, by leveraging neural object priors and converting such neural object representation as precise measurements to jointly optimize the object shape, object pose and camera pose for the final accurate 3D object reconstruction.
CHORE: Contact, Human and Object REconstruction from a single RGB image
TLDR
This work introduces CHORE, a novel method that learns to jointly reconstruct human and object from a single image that significantly outperforms the SOTA and proposes a simple yet effective depth-aware scaling that allows more efficient shape learning on real data.
BEHAVE: Dataset and Method for Tracking Human Object Interactions
TLDR
The key insight is to predict correspondences from the human and the object to a statistical body model to obtain human-object contacts during interactions to learn a model that can jointly track humans and objects in natural environments with an easy-to-use portable multi-camera setup.
Seg2Pose: Pose Estimations from Instance Segmentation Masks in One or Multiple Views for Traffic Applications
TLDR
A system is presented which converts pixel coordinate tracks, represented by instance segmentation masks across multiple video frames, into world coordinate pose tracks, for road users seen by static surveillance cameras, by using a late fusion scheme.
AutoRF: Learning 3D Object Radiance Fields from Single View Observations
TLDR
This work proposes to learn a normalized, object-centric representation whose embedding describes and disentangles shape, appearance, and pose, and improves the reconstruction quality by optimizing shape and appearance codes at test time by tightly to the input image.
3D Semantic Scene Completion: a Survey
TLDR
This survey aims to identify, compare and analyze the techniques providing a critical analysis of the SSC literature on both methods and datasets, and provides an in-depth analysis ofThe existing works covering all choices made by the authors while highlighting the remaining avenues of research.

References

SHOWING 1-10 OF 40 REFERENCES
MID-Fusion: Octree-based Object-Level Multi-Instance Dynamic SLAM
TLDR
This system is the first system to generate an object-level dynamic volumetric map from a single RGB-D camera, which can be used directly for robotic tasks and demonstrates its effectiveness by quantitatively and qualitatively testing it on both synthetic and real-world sequences.
Co-fusion: Real-time segmentation, tracking and fusion of multiple objects
  • Martin Rünz, L. Agapito
  • Computer Science
    2017 IEEE International Conference on Robotics and Automation (ICRA)
  • 2017
TLDR
Co-Fusion, a dense SLAM system that takes a live stream of RGB-D images as input and segments the scene into different objects while simultaneously tracking and reconstructing their 3D shape in real time, can enable a robot to maintain a scene description at the object level which has the potential to allow interactions with its working environment; even in the case of dynamic scenes.
RevealNet: Seeing Behind Objects in RGB-D Scans
TLDR
RevealNet is a new data-driven approach that jointly detects object instances and predicts their complete geometry, which enables a semantically meaningful decomposition of a scanned scene into individual, complete 3D objects, including hidden and unobserved object parts.
RigidFusion: RGB‐D Scene Reconstruction with Rigidly‐moving Objects
TLDR
RididFusion is a novel asynchronous moving‐object detection method, combined with a modified volumetric fusion that handles significantly more challenging reconstruction scenarios involving moving camera and improves moving‐ object detection.
MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects
  • Martin Rünz, L. Agapito
  • Computer Science
    2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)
  • 2018
TLDR
This work presents MaskFusion, a real-time, object-aware, semantic and dynamic RGB-D SLAM system that goes beyond traditional systems which output a purely geometric map of a static scene, and takes full advantage of using instance-level semantic segmentation to enable semantic labels to be fused into an object- aware map.
Real-Time Geometry, Albedo, and Motion Reconstruction Using a Single RGB-D Camera
This article proposes a real-time method that uses a single-view RGB-D input (a depth sensor integrated with a color camera) to simultaneously reconstruct a casual scene with a detailed geometry
Fusion++: Volumetric Object-Level SLAM
TLDR
An online object-level SLAM system which builds a persistent and accurate 3D graph map of arbitrary reconstructed objects is proposed, and performance evaluation shows the approach is highly memory efficient and runs online at 4-8Hz despite not being optimised at the software level.
SLAM++: Simultaneous Localisation and Mapping at the Level of Objects
TLDR
The object graph enables predictions for accurate ICP-based camera to model tracking at each live frame, and efficient active search for new objects in currently undescribed image regions, as well as the generation of an object level scene description with the potential to enable interaction.
When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM
TLDR
Experimental results demonstrate the advantages of the proposed framework with respect to traditional single view-based object recognition and pose estimation approaches, as well as its usefulness in robotic perception and augmented reality applications.
BundleFusion: real-time globally consistent 3D reconstruction using on-the-fly surface re-integration
TLDR
This work systematically addresses issues with a novel, real-time, end-to-end reconstruction framework, which outperforms state-of-the-art online systems with quality on par to offline methods, but with unprecedented speed and scan completeness.
...
...