Continuous Scene Representations for Embodied AI

  title={Continuous Scene Representations for Embodied AI},
  author={Samir Yitzhak Gadre and Kiana Ehsani and Shuran Song and Roozbeh Mottaghi},
We propose Continuous Scene Representations (CSR), a scene representation constructed by an embodied agent navigating within a space, where objects and their relationships are modeled by continuous valued embeddings. Our method captures feature relationships between objects, composes them into a graph structure on-the-fly, and situ-ates an embodied agent within the representation. Our key insight is to embed pair-wise relationships between objects in a latent space. This allows for a richer… 

Figures and Tables from this paper

Panoptic Scene Graph Generation
Panoptic scene graph generation (PSG) is introduced, a new problem task that requires the model to generate a more comprehensive scene graph representation based on panoptic segmentations rather than rigid bounding boxes.
Simple but Effective: CLIP Embeddings for Embodied AI
One of the baselines is extended, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training, and it beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge.
A Simple Approach for Visual Rearrangement: 3D Mapping and Semantic Search
This work proposes a simple yet effective method to search for and map which objects need to be rearranged, and rearrange each object until the task is complete, which improves on current state-of-the-art end-to-end reinforcement learning-based methods that learn visual rearrangement policies.
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
The proposed PROCTHOR, a framework for procedural generation of Embodied AI environments, enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks.
CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration
This paper translates the success of zero-shot vision models to the popular embodied AI task of object navigation, and finds that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.


Scene Graph Generation from Objects, Phrases and Region Captions
This work proposes a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner and shows the joint learning across three tasks with the proposed method can bring mutual improvements over previous models.
Neural Scene Graphs for Dynamic Scenes
This work proposes a learned scene graph representation, which encodes object transformations and radiance, allowing us to efficiently render novel arrangements and views of the scene, and presents the first neural rendering method that represents multi-object dynamic scenes as scene graphs.
Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions
This work proposes a learned method that regresses a scene graph from the point cloud of a scene, based on PointNet and Graph Convolutional Networks, and introduces 3DSSG, a semiautomatically generated dataset, that contains semantically rich scene graphs of 3D scenes.
Visual Room Rearrangement
The experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and the authors are still very far from achieving perfect performance on these types of tasks.
3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans
This is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking and can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.
Scene Graph Generation by Iterative Message Passing
This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image.
Kimera: From SLAM to spatial perception with 3D dynamic scene graphs
This article attempts to reduce the gap between robot and human perception by introducing a novel representation, a 3D dynamic scene graph (DSG), that seamlessly captures metric and semantic aspects of a dynamic environment.
Spatial-Temporal Transformer for Dynamic Scene Graph Generation
Spatial-temporal Transformer (STTran) is a neural network that consists of two core modules: a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and a temporal decoder which takes the output of the spatial encoding as input in order to capture the temporal dependencies between frames and infer the dynamic relationships.
Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs
This work introduces Action Genome, a representation that decomposes actions into spatio-temporal scene graphs and demonstrates the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42.7% mAP using as few as 10 examples.
SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation
A neural message passing approach to augment an input 3D indoor scene with new objects matching their surroundings by weighting messages through an attention mechanism, which significantly outperforms state-of-the-art approaches in terms of correctly predicting objects missing in a scene.