Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

@article{Ahn2022DoAI,
  title={Do As I Can, Not As I Say: Grounding Language in Robotic Affordances},
  author={Michael Ahn and Anthony Brohan and Noah Brown and Yevgen Chebotar and Omar Cortes and Byron David and Chelsea Finn and Keerthana Gopalakrishnan and Karol Hausman and Alexander Herzog and Daniel Ho and Jasmine Hsu and Julian Ibarz and Brian Ichter and Alex Irpan and Eric Jang and Rosario Jauregui Ruano and Kyle Jeffrey and Sally Jesmonth and Nikhil Jayant Joshi and Ryan C. Julian and Dmitry Kalashnikov and Yuheng Kuang and Kuang-Huei Lee and Sergey Levine and Yao Lu and Linda Luu and Carolina Parada and Peter Pastor and Jornell Quiambao and Kanishka Rao and Jarek Rettinghouse and Diego M Reyes and Pierre Sermanet and Nicolas Sievers and Clayton Tan and Alexander Toshev and Vincent Vanhoucke and Fei Xia and Ted Xiao and Peng Xu and Sichun Xu and Mengyuan Yan},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.01691}
}
possible in the world. We evaluate the proposed approach on about 100 real-world robotic tasks that involve a mobile robot accomplishing a large set of language instructions in a real kitchen in a zero-shot fashion. Our experiments validate that SayCan can execute temporally-extended, complex, and abstract instructions. Grounding the LLM in the real-world via affordances nearly doubles the performance over the non-grounded baselines. 
Inner Monologue: Embodied Reasoning through Planning with Language Models
TLDR
This work proposes that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios, and finds that closed-loop language feedback significantly improves high-level instruction completion on three domains.
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
TLDR
Each model is pre-trained on its own dataset, and it is shown that the complete system can execute a variety of user-specified instructions in real-world outdoor environments — choosing the correct sequence of landmarks through a combination of language and spatial context — and handle mistakes.
A Generalist Agent
Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato,
Imitation Learning for Visual Robotic Manipulation
TLDR
This project learns a language-conditioned policy for visual robotic manipulation through behavioural cloning conditioned on text description spec-ifying target objects to manipulate that can solve the manipulation task of “put an object into another object” with a high success rate above 70%.
Neuro-Symbolic Causal Language Planning with Commonsense Prompting
TLDR
A Neuro-Symbolic Causal Language Planner (CLAP) is proposed that elicits procedural knowledge from the LLMs with commonsense-infused prompting to solve the language planning problem in a zero-shot manner.
Learning Neuro-Symbolic Skills for Bilevel Planning
TLDR
The approach — bilevel planning with neuro-symbolic skills — can solve a wide range of tasks with varying initial states, goals, and objects, outperforming six baselines and ablations.
Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search
TLDR
It is shown that AdaSubS surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik’s Cube, and inequality proving benchmark INT, setting new state-of-the-art on INT.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
TLDR
This work shows that model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue – in which new multimodal tasks are formulated as a guided language- based exchange between different pre-existing foundation models, without additional language-based exchange.
Reasoning about Procedures with Natural Language Processing: A Tutorial
TLDR
This tutorial provides a comprehensive and in-depth view of the research on procedures, primarily in Natural Language Processing, by discussing established approaches to collect procedures, by human annotation or extraction from web resources.
Vygotskian Autotelic Artificial Intelligence: Language and Culture Internalization for Human-Like AI
Building autonomous artificial agents able to grow open-ended repertoires of skills across their lives is one of the fundamental goals of AI. To that end, a promising developmental approach recommends
...
...

References

SHOWING 1-10 OF 96 REFERENCES
Grounding Language in Play
TLDR
A simple and scalable way to condition policies on human language instead of language pairing is presented, and a simple technique that transfers knowledge from large unlabeled text corpora to robotic learning is introduced that significantly improves downstream robotic manipulation.
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
TLDR
Embodied BERT (EmBERT) is presented, a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for languageconditioned task completion and bridge the gap between successful objectcentric navigation models used for noninteractive agents and the language-guided visual task completion benchmark, ALFRED.
Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions
TLDR
This paper presents a model that takes into account the variations in natural language and ambiguities in grounding them to robotic instructions with appropriate environment context and task constraints, based on an energy function that encodes such properties in a form isomorphic to a conditional random field.
CLIPort: What and Where Pathways for Robotic Manipulation
TLDR
CLIPORT is presented, a language-conditioned imitation learning agent that combines the broad semantic understanding of CLIP with the spatial precision of Transporter and is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance, history, symbolic states, or syntactic structures.
R3M: A Universal Visual Representation for Robot Manipulation
TLDR
This work pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations, resulting in R3M.
Grounding Language to Autonomously-Acquired Skills via Goal Generation
TLDR
This work proposes a new conceptual approach to language-conditioned RL: the Language-Goal-Behavior architecture (LGB), which decouples skill learning and language grounding via an intermediate semantic representation of the world.
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
TLDR
This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution
TLDR
A persistent spatial semantic representation method is proposed that enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks, despite completely avoiding the commonly used step-by-step instructions.
Combined task and motion planning through an extensible planner-independent interface layer
TLDR
This work proposes a new approach that uses off-the-shelf task planners and motion planners and makes no assumptions about their implementation and uses a novel representational abstraction that requires only that failures in computing a motion plan for a high-level action be identifiable and expressible in the form of logical predicates at the task level.
Language-Conditioned Imitation Learning for Robot Manipulation Tasks
TLDR
This work introduces a method for incorporating unstructured natural language into imitation learning and demonstrates in a set of simulation experiments how this approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compares the results to a variety of alternative methods.
...
...