From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data

@article{Cui2022FromPT,
  title={From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data},
  author={Zichen Jeff Cui and Yibin Wang and Nur Muhammad (Mahi) Shafiullah and Lerrel Pinto},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.10047}
}
While large-scale sequence modeling from offline data has led to impressive performance gains in natural language and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modeling… 

Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

DIAL is introduced, which utilizes semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets, enabling cheaper acquisition of useful language descriptions compared to expensive human labels.

Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

This work proposes a simple yet effective model for robots to solve instruction-following tasks in vision-based environments that outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.

InstructRL: Simple yet Effective Instruction-Following Agents with Multimodal Transformer

This work proposes a simple yet effective model for robots to solve instruction-following tasks in vision-based environments that outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.

References

SHOWING 1-10 OF 52 REFERENCES

Behavior Transformers: Cloning k modes with one stone

Behavior Transformer is presented, a new technique to model unlabeled demonstration data with multiple modes and improves over prior state-of-the-art work on solving demonstrated tasks while capturing the major modes present in the pre-collected datasets.

Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations

This work presents Generalization Through Imitation (GTI), a two-stage offline imitation learning algorithm that exploits this intersecting structure to train goal-directed policies that generalize to unseen start and goal state combinations.

Learning Latent Plans from Play

Play-LMP is introduced, a method designed to handle variability in the LfP setting by organizing it in an embedding space and finding that play-supervised models, unlike their expert-trained counterparts, are more robust to perturbations and exhibit retrying-till-success.

Parrot: Data-Driven Behavioral Priors for Reinforcement Learning

This paper proposes a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials from a wide range of previously seen tasks, and shows how this learned prior can be used for rapidly learning new tasks without impeding the RL agent's ability to try out novel behaviors.

Demonstration-Bootstrapped Autonomous Practicing via Multi-Task Reinforcement Learning

This work proposes a system for reinforcement learning that leverages multi-task reinforcement learning bootstrapped with prior data to enable continuous autonomous practicing, minimizing the number of resets needed while being able to learn temporally extended behaviors.

Playful Interactions for Representation Learning

This work proposes to use playful interactions in a self-supervised manner to learn visual representations for downstream tasks and demonstrates that these representations generalize better than standard behavior cloning and can achieve similar performance with only half the number of required demonstrations.

Towards More Generalizable One-shot Visual Imitation Learning

MOSAIC (Multi-task One-Shot Imitation with self-Attention and Contrastive learning), which integrates a self-attention model architecture and a temporal contrastive module to enable better task disambiguation and more robust representation learning is proposed.

Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration

It is demonstrated that it is possible to learn complex manipulation tasks, such as picking up a towel, wiping an object, and depositing the towel to its previous position, entirely from raw images with direct behavior cloning.

Reinforcement Learning as One Big Sequence Modeling Problem

This work explores how RL can be reframed as “one big sequence modeling” problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards.

R3M: A Universal Visual Representation for Robot Manipulation

This work pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations, resulting in R3M.
...