Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?

@article{Cui2022CanFM,
  title={Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?},
  author={Yuchen Cui and Scott Niekum and Abhi Gupta and Vikash Kumar and Aravind Rajeswaran},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.11134}
}
Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end-users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene. The former is hard to interpret for non-experts and necessitates detailed state estimation and scene understanding. The latter requires the… 

R3M: A Universal Visual Representation for Robot Manipulation

This work pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations, resulting in R3M.

Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks

This is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.

A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation

A method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data is explored, and the results show that when the policy faces unseen tasks, MTGv2-history may help to improve the performance.

VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning

On a set of challenging hand manipulation tasks with sparse reward and realistic visual inputs, compared to the previous SOTA, VRL3 achieves an average of 780% better sample ef-speciency and solves the task with only 10% of the computation, demonstrating the great potential of data-driven deep reinforcement learning.

Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning

Surprisingly, it is found that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL, and this paper proposes PIE-G, a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner.

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

VIP is introduced, a self-supervised pre-trained visual representation capable of generating dense and smooth reward for unseen robotic tasks, enabling diverse reward-based policy learning methods, including visual trajectory optimization and online/offline RL.

Inner Monologue: Embodied Reasoning through Planning with Language Models

This work proposes that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios, and finds that closed-loop language feedback significantly improves high-level instruction completion on three domains.

Retrospectives on the Embodied AI Workshop

A retrospective on the state of Embodied AI research is presented and 13 challenges presented at the EmbodiedAI Workshop at CVPR are grouped into three themes: visual navigation, rearrangement and integration.

Transfer Knowledge from Natural Language to Electrocardiography: Can We Detect Cardiovascular Disease Through Language Models?

The proposed approach is able to generate high-quality cardiac diagnosis reports and achieves competitive zero-shot classification performance even compared with supervised baselines, which proves the feasibility of transferring knowledge from LLMs to the cardiac domain.

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

A new compositionality evaluation benchmark, CREPE, is introduced, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity.

References

SHOWING 1-10 OF 42 REFERENCES

Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours

  • Lerrel PintoA. Gupta
  • Computer Science
    2016 IEEE International Conference on Robotics and Automation (ICRA)
  • 2016
This paper takes the leap of increasing the available training data to 40 times more than prior work, leading to a dataset size of 50K data points collected over 700 hours of robot grasping attempts, which allows us to train a Convolutional Neural Network for the task of predicting grasp locations without severe overfitting.

Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation

A framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning, and observes that the method naturally identifies semantically meaningful states as subgoals.

End-to-End Robotic Reinforcement Learning without Reward Engineering

This paper proposes an approach for removing the need for manual engineering of reward specifications by enabling a robot to learn from a modest number of examples of successful outcomes, followed by actively solicited queries, where the robot shows the user a state and asks for a label to determine whether that state represents successful completion of the task.

Simple but Effective: CLIP Embeddings for Embodied AI

One of the baselines is extended, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training, and it beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge.

Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration

  • Edward Johns
  • Computer Science
    2021 IEEE International Conference on Robotics and Automation (ICRA)
  • 2021
We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration, without requiring any prior knowledge of

Visual Reinforcement Learning with Imagined Goals

An algorithm is proposed that acquires general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies, efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques.

RRL: Resnet as representation for Reinforcement Learning

This work proposes RRL: Resnet as representation for Reinforcement Learning – a straight-forward yet effective approach that can learn complex behaviors directly from proprioceptive inputs and delivers results comparable to learning directly from the state.

Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations

This work presents Generalization Through Imitation (GTI), a two-stage offline imitation learning algorithm that exploits this intersecting structure to train goal-directed policies that generalize to unseen start and goal state combinations.

CLIPort: What and Where Pathways for Robotic Manipulation

CLIPORT is presented, a language-conditioned imitation learning agent that combines the broad semantic understanding of CLIP with the spatial precision of Transporter and is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance, history, symbolic states, or syntactic structures.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.