Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?

  title={Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?},
  author={Yuchen Cui and Scott Niekum and Abhi Gupta and Vikash Kumar and Aravind Rajeswaran},
Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end-users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene. The former is hard to interpret for non-experts and necessitates detailed state estimation and scene understanding. The latter requires the… 

R3M: A Universal Visual Representation for Robot Manipulation

This work pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations, resulting in R3M.

Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks

This is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Value-Implicit Pre-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks, is introduced and can provide dense visual reward for an extensive set of simulated and real-robot tasks.

VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning

On a set of challenging hand manipulation tasks with sparse reward and realistic visual inputs, compared to the previous SOTA, VRL3 achieves an average of 780% better sample ef-speciency and solves the task with only 10% of the computation, demonstrating the great potential of data-driven deep reinforcement learning.

Inner Monologue: Embodied Reasoning through Planning with Language Models

This work proposes that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios, and finds that closed-loop language feedback significantly improves high-level instruction completion on three domains.

A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation

A method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data is explored, and the results show that when the policy faces unseen tasks, MTGv2-history may help to improve the performance.

Retrospectives on the Embodied AI Workshop

This analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR, grouped into three themes: visual navigation, rearrangement, and embodied vision-and-language.



Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours

  • Lerrel PintoA. Gupta
  • Computer Science
    2016 IEEE International Conference on Robotics and Automation (ICRA)
  • 2016
This paper takes the leap of increasing the available training data to 40 times more than prior work, leading to a dataset size of 50K data points collected over 700 hours of robot grasping attempts, which allows us to train a Convolutional Neural Network for the task of predicting grasp locations without severe overfitting.

Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation

A framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning, and observes that the method naturally identifies semantically meaningful states as subgoals.

End-to-End Robotic Reinforcement Learning without Reward Engineering

This paper proposes an approach for removing the need for manual engineering of reward specifications by enabling a robot to learn from a modest number of examples of successful outcomes, followed by actively solicited queries, where the robot shows the user a state and asks for a label to determine whether that state represents successful completion of the task.

Simple but Effective: CLIP Embeddings for Embodied AI

One of the baselines is extended, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training, and it beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge.

Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration

  • Edward Johns
  • Computer Science
    2021 IEEE International Conference on Robotics and Automation (ICRA)
  • 2021
We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration, without requiring any prior knowledge of

Visual Reinforcement Learning with Imagined Goals

An algorithm is proposed that acquires general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies, efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques.

RRL: Resnet as representation for Reinforcement Learning

This work proposes RRL: Resnet as representation for Reinforcement Learning – a straight-forward yet effective approach that can learn complex behaviors directly from proprioceptive inputs and delivers results comparable to learning directly from the state.

Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations

This work presents Generalization Through Imitation (GTI), a two-stage offline imitation learning algorithm that exploits this intersecting structure to train goal-directed policies that generalize to unseen start and goal state combinations.

CLIPort: What and Where Pathways for Robotic Manipulation

CLIPORT is presented, a language-conditioned imitation learning agent that combines the broad semantic understanding of CLIP with the spatial precision of Transporter and is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance, history, symbolic states, or syntactic structures.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.