SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation

  title={SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation},
  author={Yiqing Liang and Boyuan Chen and Shuran Song},
  journal={2021 IEEE International Conference on Robotics and Automation (ICRA)},
This paper focuses on visual semantic navigation, the task of producing actions for an active agent to navigate to a specified target object category in an unknown environment. To complete this task, the algorithm should simultaneously locate and navigate to an instance of the category. In comparison to the traditional point goal navigation, this task requires the agent to have a stronger contextual prior to indoor environments. We introduce SSCNav, an algorithm that explicitly models scene… 

Figures and Tables from this paper

Navigating to Objects in Unseen Environments by Distance Prediction
This work proposes an object goal navigation framework, which could directly perform path planning based on an estimated distance map, and takes a birds-eye-view semantic map as input, and estimates the distance from the map cells to the target object based on the learned prior knowledge.
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning
A network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object is proposed, and achieves the state-of-the-art for ObjectNav while incurring up to 1, 600 × less computational cost for training.
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
We present a scalable approach for learning open-world object-goal navigation ( ObjectNav ) – the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment
Imagination-augmented Navigation Based on 2D Laser Sensor Observations
An imagination-enhanced navigation based on 2D semantic laser scan data, which contains an imagination module, which can predict the entire occupied area of the object, with the cost of a longer path and slower velocity is proposed.
CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration
This paper translates the success of zero-shot vision models to the popular embodied AI task of object navigation, and finds that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.
Cross-modal Map Learning for Vision and Language Navigation
This work proposes a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints.
Uncertainty-driven Planner for Exploration and Navigation
A novel planning framework is presented that first learns to generate occupancy maps beyond the field-of-view of the agent, and second leverages the model uncertainty over the generated areas to formulate path selection policies for each task of interest.
Multi-Agent Embodied Visual Semantic Navigation With Scene Prior Knowledge
A hierarchical decision framework based on semantic mapping, scene prior knowledge, and communication mechanism to solve visual semantic navigation, a challenging task that requires agents to learn reasonable collaboration strategies to perform efficient exploration under the restrictions of communication bandwidth is developed.
Scene Editing as Teleoperation: A Case Study in 6DoF Kit Assembly
A category-agnostic scene-completion algorithm that translates the real-world workspace into a manipulable virtual scene representation and an action-snapping algorithm that refines the user input before generating the robot’s action plan are utilized.
Auxiliary Tasks and Exploration Enable ObjectGoal Navigation
This work proposes that agents will act to simplify their visual inputs so as to smooth their RNN dynamics, and that auxiliary tasks reduce overfitting by minimizing effective RNN dimensionality; i.e. a performant ObjectNav agent that must maintain coherent plans over long horizons does so by learning smooth, low-dimensional recurrent dynamics.


Object Goal Navigation using Goal-Oriented Semantic Exploration
A modular system called, `Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently based on the goal object category and outperforms a wide range of baselines including end-to-end learning-based methods as well as modular map- based methods.
Visual Semantic Navigation using Scene Priors
This work proposes to use Graph Convolutional Networks for incorporating the prior knowledge into a deep reinforcement learning framework and shows how semantic knowledge improves performance significantly and improves in generalization to unseen scenes and/or objects.
Learning hierarchical relationships for object-goal navigation
Memory-utilized Joint hierarchical Object Learning for Navigation in Indoor Rooms (MJOLNIR), a target-driven visual navigation algorithm, which considers the inherent relationship between "target" objects, along with the more salient "parent" objects occurring in its surrounding, and learns to converge much faster than other algorithms.
Spatial Action Maps for Mobile Manipulation
This work presents "spatial action maps," in which the set of possible actions is represented by a pixel map (aligned with the input image of the current state), where each pixel represents a local navigational endpoint at the corresponding scene location.
ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects
This document summarizes the consensus recommendations of this working group on ObjectNav and makes recommendations on subtle but important details of evaluation criteria, the agent's embodiment parameters, and the characteristics of the environments within which the task is carried out.
Semantic Scene Completion from a Single Depth Image
The semantic scene completion network (SSCNet) is introduced, an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum.
Emergence of exploratory look-around behaviors through active observation completion
This work proposes a reinforcement learning solution, where the agent is rewarded for reducing its uncertainty about the unobserved portions of its environment, and introduces sidekick policy learning, which exploits the asymmetry in observability between training and test time.
Semantic Visual Navigation by Watching YouTube Videos
This paper learns and leverages semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos, and improves upon end-to-end RL methods by 66%, while using 250x fewer interactions.
Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks
This work proposes a reinforcement learning solution, where the agent is rewarded for actions that reduce its uncertainty about the unobserved portions of its environment and develops a recurrent neural network-based approach to perform active completion of panoramic natural scenes and 3D object shapes.
Human-Centric Indoor Scene Synthesis Using Stochastic Grammar
We present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, to obtain large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial