Spatial Action Maps for Mobile Manipulation

  title={Spatial Action Maps for Mobile Manipulation},
  author={Jimmy Wu and Xingyuan Sun and Andy Zeng and Shuran Song and Johnny Lee and Szymon M. Rusinkiewicz and Thomas A. Funkhouser},
Typical end-to-end formulations for learning robotic navigation involve predicting a small set of steering command actions (e.g., step forward, turn left, turn right, etc.) from images of the current state (e.g., a bird's-eye view of a SLAM reconstruction). Instead, we show that it can be advantageous to learn with dense action representations defined in the same domain as the state. In this work, we present "spatial action maps," in which the set of possible actions is represented by a pixel… 

Spatial Action Maps Augmented with Visit Frequency Maps for Exploration Tasks

The visit frequency map (VFM) and its corresponding reward function are introduced to direct the agent to actively search previously unexplored areas and show conclusively that the method is more efficient than other methods.

PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

A network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object is proposed, a modular approach that disentangles the skills of 'where to look?' for an object and 'how to navigate to $(x,\ y)$?’.

ReLMoGen: Integrating Motion Generation in Reinforcement Learning for Mobile Manipulation

It is argued that, by lifting the action space and by leveraging sampling-based motion planners, this work can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space.

Towards Disturbance-Free Visual Mobile Manipulation

This paper studies the problem of training agents to complete the task of visual mobile manipulation in the ManipulaTHOR environment while avoiding unnecessary collision (disturbance) with objects, and proposes a two-stage training curriculum where an agent is first allowed to freely explore and build basic competencies without penalization.

Generalization Through Hand-Eye Coordination: An Action Space for Learning Spatially-Invariant Visuomotor Control

This work presents a learnable action space, Hand-eye Action Networks (HAN) that learns coordinated hand-eye movements from human teleoperated demonstrations and shows that a visuomotor policy equipped with HAN is able to inherit the key spatial invariance property of handeye coordination and achieve generalization to new scene configurations.

Audio-Visual Waypoints for Navigation

This work introduces a reinforcement learning approach to audio- visual navigation with two key novel elements 1) audio-visual waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves.

Learning to Set Waypoints for Audio-Visual Navigation

This work introduces a reinforcement learning approach to audio-visual navigation with two key novel elements: waypoints that are dynamically set and learned end-to-end within the navigation policy, and an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves.

Policy learning in SE(3) action spaces

ASRSE3 is proposed, a new method for handling higher dimensional spatial action spaces that transforms an original MDP with high dimensional action space into a new M DP with reduced action space and augmented state space and it is shown that both methods outperform standard baselines and can be used in practice on real robotics systems.

VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects

This paper proposes object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware , interaction-aware, and task-aware visual action affordance and trajectory proposals.

Learning Visual Affordances with Target-Orientated Deep Q-Network to Grasp Objects by Harnessing Environmental Fixtures

It is empirically shown that TO-DQN can learn to solve the task in different environment settings in simulation and outperforms a standard and a variant of Deep Q-Network in terms of training efficiency and robustness.



Learning to Move with Affordance Maps

This paper designs an agent that learns to predict a spatial affordance map that elucidates what parts of a scene are navigable through active self-supervised experience gathering, and shows that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.

Cognitive Mapping and Planning for Visual Navigation

The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world.

From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots

This work presents the first approach that learns a target-oriented end-to-end navigation model for a robotic platform, and demonstrates that the learned navigation model is directly transferable to previously unseen virtual and, more interestingly, real-world environments.

Intention-Net: Integrating Planning and Deep Learning for Goal-Directed Autonomous Navigation

A two-level hierarchical approach, which integrates model-free deep learning and model-based path planning, is introduced, which suggests that the learned motion controller is robust against perceptual uncertainty and by integrating with a path planner, it generalizes effectively to new environments and goals.

Playing Doom with SLAM-Augmented Deep Reinforcement Learning

Inspired from prior work in human cognition that indicates how humans employ a variety of semantic concepts and abstractions to reason about the world, an agent-model is built that incorporates such abstractions into its policy-learning framework.

FollowNet: Robot Navigation by Following Natural Language Directions with Deep Reinforcement Learning

It is shown that the FollowNet agent learns to execute previously unseen instructions described with a similar vocabulary, and successfully navigates along paths not encountered during training, and shows 30% improvement over a baseline model without the attention mechanism.

Learning Visual Affordances for Robotic Manipulation

This thesis shows that it is possible to workaround this limitation of model-free reinforcement learning to sequence primitive picking motions for more complex manipulation policies, and studies how it can be combined with residual physics to enable learning end-to-end visuomotor policies that leverage the benefits of analytical models while still maintaining the capacity (via data-driven residuals) to account for real-world dynamics that are not explicitly modeled.

Learning to See before Learning to Act: Visual Pre-training for Manipulation

It is found that pre-training on vision tasks significantly improves generalization and sample efficiency for learning to manipulate objects, and directly transferring model parameters from vision networks to affordance prediction networks can result in successful zero-shot adaptation.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Gibson Env: Real-World Perception for Embodied Agents

This paper investigates developing real-world perception for active agents, proposes Gibson Environment for this purpose, and showcases a set of perceptual tasks learned therein.