• Corpus ID: 226299546

Transformers for One-Shot Visual Imitation

  title={Transformers for One-Shot Visual Imitation},
  author={Sudeep Dasari and Abhinav Kumar Gupta},
  booktitle={Conference on Robot Learning},
  • S. DasariA. Gupta
  • Published in Conference on Robot Learning 11 November 2020
  • Computer Science
Humans are able to seamlessly visually imitate others, by inferring their intentions and using past experience to achieve the same end goal. In other words, we can parse complex semantic knowledge from raw video and efficiently translate that into concrete motor control. Is it possible to give a robot this same capability? Prior research in robot imitation learning has created agents which can acquire diverse skills from expert human operators. However, expanding these techniques to work with a… 

Figures and Tables from this paper

Manipulator-Independent Representations for Visual Imitation

A way to train manipulator-independent representations (MIR) that primarily focus on the change in the environment and have all the characteristics that make them suitable for crossembodiment visual imitation with RL: cross-domain alignment, temporal smoothness, and being actionable is presented.

Towards More Generalizable One-shot Visual Imitation Learning

MOSAIC (Multi-task One-Shot Imitation with self-Attention and Contrastive learning), which integrates a self-attention model architecture and a temporal contrastive module to enable better task disambiguation and more robust representation learning is proposed.

VIMA: General Robot Manipulation with Multimodal Prompts

This work designs a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively and achieves strong scalability in both model capacity and data size.

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

This work investigates P ER A CT, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation, and shows that it significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Meta-Imitation Learning by Watching Video Demonstrations

This work presents an approach of meta-imitation learning by watching video demonstrations from humans that is able to translate human videos into practical robot demonstrations and train the meta-policy with adaptive loss based on the quality of the translated data.

What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data

An extensive study of the most critical challenges in learning language conditioned policies from offline free-form imitation datasets is conducted and a novel approach is presented that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark.

BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning

An interactive and exible imitation learning system that can learn from both demonstrations and interventions and can be conditioned on different forms of information that convey the task, including pretrained embeddings of natural language or videos of humans performing the task.

Learning with Dual Demonstration Domains: Random Domain-Adaptive Meta-Learning

This paper proposes a novel yet efficient Random Domain-Adaptive Meta-Learning (RDAML) framework to teach the robot to learn from multiple demonstration domains (e.g., human demonstrations + robot demonstrations) with different random sampling parameters.

Demonstration-Conditioned Reinforcement Learning for Few-Shot Imitation

This work proposes a novel approach to learning few-shotimitation agents that is called demonstrationconditioned reinforcement learning (DCRL), and shows that DCRL outperforms methods based on behaviour cloning, on navigation tasks and on robotic manipulation tasks from the Meta-World benchmark.

From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data

This work presents Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification and demonstrates for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information.



One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

This work presents an approach for one-shot learning from a video of a human by using human and robot demonstration data from a variety of previous tasks to build up prior knowledge through meta-learning, then combining this prior knowledge and only a single video demonstration from a human, the robot can perform the task that the human demonstrated.

One-Shot Visual Imitation Learning via Meta-Learning

A meta-imitation learning method that enables a robot to learn how to learn more efficiently, allowing it to acquire new skills from just a single demonstration, and requires data from significantly fewer prior tasks for effective learning of new skills.

Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

A hierarchical setup where a high-level module learns to generate a series of first-person sub-goals conditioned on the third-person video demonstration, and a low-level controller predicts the actions to achieve those sub-Goals is proposed.

Grounding Language in Play

A simple and scalable way to condition policies on human language instead of language pairing is presented, and a simple technique that transfers knowledge from large unlabeled text corpora to robotic learning is introduced that significantly improves downstream robotic manipulation.

One-Shot Imitation Learning

A meta-learning framework for achieving one-shot imitation learning, where ideally, robots should be able to learn from very few demonstrations of any given task, and instantly generalize to new situations of the same task, without requiring task-specific engineering.

Time-Contrastive Networks: Self-Supervised Learning from Video

A self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints is proposed, and it is demonstrated that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be use as a reward function within a reinforcement learning algorithm.

Concept2Robot: Learning manipulation concepts from instructions and human demonstrations

This work aims to endow a robot with the ability to learn manipulation concepts that link natural language instructions to motor skills by proposing a two-stage learning process where the robot first learns single-task policies through reinforcement learning and a multi-task policy through imitation learning.

AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

This paper takes an automated approach and performs pixel-level image translation via CycleGAN to convert the human demonstration into a video of a robot, which can then be used to construct a reward function for a model-based RL algorithm.

Unsupervised Perceptual Rewards for Imitation Learning

This work presents a method that is able to identify key intermediate steps of a task from only a handful of demonstration sequences, and automatically identify the most discriminative features for identifying these steps.

Improvisation through Physical Understanding: Using Novel Objects as Tools with Visual Foresight

This work training a model with both a visual and physical understanding of multi-object interactions, and develops a sampling-based optimizer that can leverage these interactions to accomplish tasks, shows that the robot can perceive and use novel objects as tools, including objects that are not conventional tools, while also choosing dynamically to use or not use tools depending on whether or not they are required.