• Corpus ID: 235795722

A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

  title={A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution},
  author={Valts Blukis and Chris Paxton and Dieter Fox and Animesh Garg and Yoav Artzi},
: Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it… 

Figures and Tables from this paper

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

It is shown how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment.

A Planning based Neural-Symbolic Approach for Embodied Instruction Following

This work proposes a princi-pled neural-symbolic approach combining symbolic planning and deep-learning methods for visual perception and NL processing that can act in un-structured environments when the set of skills and possible relationships is known.

Skill Induction and Planning with Latent Language

A framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making, achieves performance comparable state-of-the-art models on ALFRED success rate and outperforming several recent methods with access to ground-truth plans.

Neuro-Symbolic Causal Language Planning with Commonsense Prompting

A Neuro-Symbolic Causal Language Planner (CLAP) is proposed that elicits procedural knowledge from the LLMs with commonsense-infused prompting to solve the language planning problem in a zero-shot manner.

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following

benchmarking evaluates the impacts of different components of, or options for, the vision-and-language learning model and shows the effectiveness of pretraining strategies, as well as assessing the robustness of the framework to novel scenarios.

Few-shot Subgoal Planning with Language Models

This work shows that language priors encoded in pre-trained models allow us to infer fine-grained subgoal sequences from few training sequences without any fine-tuning, and proposes a simple strategy to re-rank language model predictions based on interaction and feedback from the environment.

A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment

This paper proposes a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data, and proposes a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following.

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents, is proposed, which achieves state-of-the-art (SOTA) results on all three dialogbased embodied tasks.

Embodied Multi-Agent Task Planning from Ambiguous Instruction

An embodied multi-agent task planning framework is proposed to utilize external knowledge sources and dynamically perceived visual information to resolve the high-level instructions, and dynamically allocate the decomposed tasks to multiple agents and generate sub-goals to achieve the navigation motion.

One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

A modelagnostic milestone-based task tracker (M-TRACK) that tags the instructions with navigation and interaction milestones which the agent needs to complete step by step, and a milestone checker that systemically checks the agent's progress in its current milestone and determines when to proceed to the next.



Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments

A novel framework that learns to adapt perception according to the task in order to maintain compact distributions over semantic maps is proposed and experiments with a mobile manipulator demonstrate more efficient instruction following in a priori unknown environments.

Environment-Driven Lexicon Induction for High-Level Instructions

A new hybrid approach that leverages the environment to induce new lexical entries at test time, even for new verbs is proposed, which jointly reasons about the text, logical forms, and environment over multi-stage instruction sequences.

Prospection: Interpretable plans from language by predicting the future

A framework for learning representations that convert from a natural-language command to a sequence of intermediate goals for execution on a robot and a key feature of this framework is prospection, training an agent not just to correctly execute the prescribed command, but to predict a horizon of consequences of an action before taking it.

Episodic Transformer for Vision-and-Language Navigation

This paper proposes Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions that sets a new state of the art on the challenging ALFRED benchmark.

Grounding Robot Plans from Natural Language Instructions with Incomplete World Knowledge

A probabilistic model is introduced that utilizes background knowledge to infer latent or missing plan constituents based on semantic co-associations learned from noisy textual corpora of task descriptions to enable robots to interpret and execute high-level tasks conveyed using natural language instructions.

Learning to Parse Natural Language Commands to a Robot Control System

This work discusses the problem of parsing natural language commands to actions and control structures that can be readily implemented in a robot execution system, and learns a parser based on example pairs of English commands and corresponding control language expressions.

Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following

This work introduces a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions, and presents a learned map representation that encodes object locations and their instructed use.

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

A new model for understanding natural language commands given to autonomous systems that perform navigation and mobile manipulation in semi-structured environments that dynamically instantiates a probabilistic graphical model for a particular natural language command according to the command's hierarchical and compositional semantic structure.

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

A model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them is designed.

Sim-to-Real Transfer for Vision-and-Language Navigation

To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, a subgoal model is proposed to identify nearby waypoints, and domain randomization is used to mitigate visual domain differences.