Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
@article{Ahn2022DoAI, title={Do As I Can, Not As I Say: Grounding Language in Robotic Affordances}, author={Michael Ahn and Anthony Brohan and Noah Brown and Yevgen Chebotar and Omar Cortes and Byron David and Chelsea Finn and Keerthana Gopalakrishnan and Karol Hausman and Alexander Herzog and Daniel Ho and Jasmine Hsu and Julian Ibarz and Brian Ichter and Alex Irpan and Eric Jang and Rosario Jauregui Ruano and Kyle Jeffrey and Sally Jesmonth and Nikhil Jayant Joshi and Ryan C. Julian and Dmitry Kalashnikov and Yuheng Kuang and Kuang-Huei Lee and Sergey Levine and Yao Lu and Linda Luu and Carolina Parada and Peter Pastor and Jornell Quiambao and Kanishka Rao and Jarek Rettinghouse and Diego M Reyes and Pierre Sermanet and Nicolas Sievers and Clayton Tan and Alexander Toshev and Vincent Vanhoucke and Fei Xia and Ted Xiao and Peng Xu and Sichun Xu and Mengyuan Yan}, journal={ArXiv}, year={2022}, volume={abs/2204.01691} }
possible in the world. We evaluate the proposed approach on about 100 real-world robotic tasks that involve a mobile robot accomplishing a large set of language instructions in a real kitchen in a zero-shot fashion. Our experiments validate that SayCan can execute temporally-extended, complex, and abstract instructions. Grounding the LLM in the real-world via affordances nearly doubles the performance over the non-grounded baselines.
Figures and Tables from this paper
22 Citations
Inner Monologue: Embodied Reasoning through Planning with Language Models
- Computer ScienceArXiv
- 2022
This work proposes that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios, and finds that closed-loop language feedback significantly improves high-level instruction completion on three domains.
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
- Computer ScienceArXiv
- 2022
Each model is pre-trained on its own dataset, and it is shown that the complete system can execute a variety of user-specified instructions in real-world outdoor environments — choosing the correct sequence of landmarks through a combination of language and spatial context — and handle mistakes.
A Generalist Agent
- ArtArXiv
- 2022
Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato,…
Imitation Learning for Visual Robotic Manipulation
- Computer Science
- 2022
This project learns a language-conditioned policy for visual robotic manipulation through behavioural cloning conditioned on text description spec-ifying target objects to manipulate that can solve the manipulation task of “put an object into another object” with a high success rate above 70%.
Neuro-Symbolic Causal Language Planning with Commonsense Prompting
- Computer ScienceArXiv
- 2022
A Neuro-Symbolic Causal Language Planner (CLAP) is proposed that elicits procedural knowledge from the LLMs with commonsense-infused prompting to solve the language planning problem in a zero-shot manner.
Learning Neuro-Symbolic Skills for Bilevel Planning
- Computer ScienceArXiv
- 2022
The approach — bilevel planning with neuro-symbolic skills — can solve a wide range of tasks with varying initial states, goals, and objects, outperforming six baselines and ablations.
Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search
- Computer ScienceArXiv
- 2022
It is shown that AdaSubS surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik’s Cube, and inequality proving benchmark INT, setting new state-of-the-art on INT.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
- Computer ScienceArXiv
- 2022
This work shows that model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue – in which new multimodal tasks are formulated as a guided language- based exchange between different pre-existing foundation models, without additional language-based exchange.
Reasoning about Procedures with Natural Language Processing: A Tutorial
- Computer ScienceArXiv
- 2022
This tutorial provides a comprehensive and in-depth view of the research on procedures, primarily in Natural Language Processing, by discussing established approaches to collect procedures, by human annotation or extraction from web resources.
Vygotskian Autotelic Artificial Intelligence: Language and Culture Internalization for Human-Like AI
- PsychologyArXiv
- 2022
Building autonomous artificial agents able to grow open-ended repertoires of skills across their lives is one of the fundamental goals of AI. To that end, a promising developmental approach recommends…
References
SHOWING 1-10 OF 96 REFERENCES
Grounding Language in Play
- Computer ScienceArXiv
- 2020
A simple and scalable way to condition policies on human language instead of language pairing is presented, and a simple technique that transfers knowledge from large unlabeled text corpora to robotic learning is introduced that significantly improves downstream robotic manipulation.
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
- Computer ScienceArXiv
- 2021
Embodied BERT (EmBERT) is presented, a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for languageconditioned task completion and bridge the gap between successful objectcentric navigation models used for noninteractive agents and the language-guided visual task completion benchmark, ALFRED.
Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions
- Computer ScienceInt. J. Robotics Res.
- 2016
This paper presents a model that takes into account the variations in natural language and ambiguities in grounding them to robotic instructions with appropriate environment context and task constraints, based on an energy function that encodes such properties in a form isomorphic to a conditional random field.
CLIPort: What and Where Pathways for Robotic Manipulation
- Computer ScienceCoRL
- 2021
CLIPORT is presented, a language-conditioned imitation learning agent that combines the broad semantic understanding of CLIP with the spatial precision of Transporter and is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance, history, symbolic states, or syntactic structures.
R3M: A Universal Visual Representation for Robot Manipulation
- Computer ScienceArXiv
- 2022
This work pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations, resulting in R3M.
Grounding Language to Autonomously-Acquired Skills via Goal Generation
- Computer ScienceICLR
- 2021
This work proposes a new conceptual approach to language-conditioned RL: the Language-Goal-Behavior architecture (LGB), which decouples skill learning and language grounding via an intermediate semantic representation of the world.
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
- Computer ScienceICML
- 2022
This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution
- Computer ScienceCoRL
- 2021
A persistent spatial semantic representation method is proposed that enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks, despite completely avoiding the commonly used step-by-step instructions.
Combined task and motion planning through an extensible planner-independent interface layer
- Computer Science2014 IEEE International Conference on Robotics and Automation (ICRA)
- 2014
This work proposes a new approach that uses off-the-shelf task planners and motion planners and makes no assumptions about their implementation and uses a novel representational abstraction that requires only that failures in computing a motion plan for a high-level action be identifiable and expressible in the form of logical predicates at the task level.
Language-Conditioned Imitation Learning for Robot Manipulation Tasks
- Computer ScienceNeurIPS
- 2020
This work introduces a method for incorporating unstructured natural language into imitation learning and demonstrates in a set of simulation experiments how this approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compares the results to a variety of alternative methods.