Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

  title={Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models},
  author={Ted Xiao and Harris Chan and Pierre Sermanet and Ayzaan Wahid and Anthony Brohan and Karol Hausman and Sergey Levine and Jonathan Tompson},
In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models (VLMs) like CLIP [38] or ViLD [21] have been applied to robotics for learning representations and scene… 

Distilling Internet-Scale Vision-Language Models into Embodied Agents

This work outlines a new and effective way to use internet-scale VLMs, repur-posing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.