Interactive Language: Talking to Robots in Real Time

  title={Interactive Language: Talking to Robots in Real Time},
  author={Corey Lynch and Ayzaan Wahid and Jonathan Tompson and Tianli Ding and James Betker and Robert K. Baruch and Travis Armstrong and Peter R. Florence},
—We present a framework for building interactive, real- time, natural language-instructable robots in the real world, and we open source related assets (dataset, environment, benchmark, and policies). Trained with behavioral cloning on a dataset of hundreds of thousands of language-annotated trajectories, a produced policy can proficiently execute an order of magnitude more commands than previous works: specifically we estimate a 93.5% success rate on a set of 87,000 unique natural language… 

Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

DIAL is introduced, which utilizes semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets, enabling cheaper acquisition of useful language descriptions compared to expensive human labels.

Skill Acquisition by Instruction Augmentation on Offline Datasets

DIAL is applied to a challenging real-world robotic manipulation domain, enabling imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.

PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav

This work presents a two-stage learning scheme for IL pretraining on human demonstrations followed by RL-finetuning, and investigates whether human demonstrations can be replaced with ‘free’ sources of demonstrations, e.g .

StructDiffusion: Object-Centric Diffusion for Semantic Rearrangement of Novel Objects

This work proposes StructDiffusion, which combines a diffusion model and an object-centric transformer to construct structures out of a single RGB-D image based on high-level language goals, such as “set the table”, and shows how diffusion models can be used for complex multi-step 3D planning tasks.

Calibrated Interpretation: Confidence Estimation in Semantic Parsing

This work examines the calibration characteristics of six models across three model families on two common English semantic parsing datasets, finding that many models are reasonably well-calibrated and that there is a trade-off between calibration and performance.

Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

This paper demonstrates that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images.



Towards Real-Time Natural Language Corrections for Assistive Robots

This paper proposes a generalizable natural language interface that allows users to provide corrective instructions to an assistive robotic manipulator in real-time and develops a language model using data collected from Amazon Mechanical Turk in hopes of capturing a comprehensive selection of terminology that real people use to describe desired corrections.

CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks, is presented, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark.

Correcting Robot Plans with Natural Language Feedback

This paper describes how to map from natural language sentences to transformations of cost functions and shows that these transformations enable users to correct goals, update robot motions to accommodate additional user preferences, and recover from planning errors.

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

It is shown how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment.

Language Conditioned Imitation Learning Over Unstructured Data

This work presents a method for incorporating free-form natural language conditioning into imitation learning, and proposes combining text conditioned policies with large pretrained neural language models to scale up the number of instructions an agent can follow.

Learning to Parse Natural Language Commands to a Robot Control System

This work discusses the problem of parsing natural language commands to actions and control structures that can be readily implemented in a robot execution system, and learns a parser based on example pairs of English commands and corresponding control language expressions.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

A new model for understanding natural language commands given to autonomous systems that perform navigation and mobile manipulation in semi-structured environments that dynamically instantiates a probabilistic graphical model for a particular natural language command according to the command's hierarchical and compositional semantic structure.

Inner Monologue: Embodied Reasoning through Planning with Language Models

This work proposes that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios, and finds that closed-loop language feedback significantly improves high-level instruction completion on three domains.

Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

This work studies the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction, and outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%, and is able to perform visuomotor tasks from natural language, such as “open the right drawer” and “move the stapler” on a Franka Emika Panda robot.