• Corpus ID: 244908821

CALVIN: A Benchmark for Language-conditioned Policy Learning for Long-horizon Robot Manipulation Tasks

@article{Mees2021CALVINAB,
  title={CALVIN: A Benchmark for Language-conditioned Policy Learning for Long-horizon Robot Manipulation Tasks},
  author={Oier Mees and Luk{\'a}s Hermann and Erick Rosete-Beas and Wolfram Burgard},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.03227}
}
General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn longhorizon language-conditioned… 

Figures from this paper

LISA: Learning Interpretable Skill Abstractions
  • Computer Science
  • 2022
TLDR
This work proposes Learning Interpretable Skill Abstractions (LISA), a hierarchical imitation learning framework that can learn diverse, interpretable skills from language-conditioned demonstrations that is able to outperform a strong non-hierarchical baseline in the low data regime and compose learned skills to solve tasks containing unseen long-range instructions.
Summarizing a virtual robot's past actions in natural language
TLDR
It is shown how a popular existing dataset that matches robot actions with natural language descriptions designed for an instruction following task can be repurposed to serve as a training ground for robot action summarization work.
What Matters in Language Conditioned Robotic Imitation Learning
TLDR
This paper conducts an extensive study of the most critical challenges in learning language conditioned policies from offline free-form imitation datasets and presents a novel approach that outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark.
LISA: Learning Interpretable Skill Abstractions from Language
TLDR
This work proposes Learning Interpretable Skill Abstractions (LISA), a hierarchical imitation learning framework that can learn diverse, interpretable skills from languageconditioned demonstrations that is able to outperform a strong non-hierarchical baseline in the low data regime and compose learned skills to solve tasks containing unseen long-range instructions.

References

SHOWING 1-10 OF 59 REFERENCES
Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations
TLDR
This work proposes a two-stage learning process where a single multi-task policy is learned that takes as input a natural language instruction and an image of the initial scene and outputs a robot motion trajectory to achieve the specified task.
Language-Conditioned Imitation Learning for Robot Manipulation Tasks
TLDR
This work introduces a method for incorporating unstructured natural language into imitation learning and demonstrates in a set of simulation experiments how this approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compares the results to a variety of alternative methods.
Affordance Learning from Play for Sample-Efficient Policy Learning
TLDR
This work proposes a novel approach that extracts a self- supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy learning and motion planning and demonstrates the effectiveness of visual affordances to guide model-based policies and closed-loop RL policies to learn robot manipulation tasks in the real world.
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
TLDR
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.
BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning
TLDR
An interactive and flexible imitation learning system that can learn from both demonstrations and interventions and can be conditioned on different forms of information that convey the task, including pretrained embeddings of natural language or videos of humans performing the task.
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
TLDR
A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL), and a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions is introduced.
Composing Pick-and-Place Tasks By Grounding Language
TLDR
This work presents a robot system that follows unconstrained language instructions to pick and place arbitrary objects and effectively resolves ambiguities through dialogues and demonstrates the effectiveness of the method in understanding pick-and-place language instructions and sequentially composing them to solve tabletop manipulation tasks.
Grounded Language Learning in a Simulated 3D World
TLDR
An agent is presented that learns to interpret language in a simulated 3D environment where it is rewarded for the successful execution of written instructions and its comprehension of language extends beyond its prior experience, enabling it to apply familiar language to unfamiliar situations and to interpret entirely novel instructions.
CLIPort: What and Where Pathways for Robotic Manipulation
TLDR
CLIPORT is presented, a language-conditioned imitation learning agent that combines the broad semantic understanding of CLIP with the spatial precision of Transporter and is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance, history, symbolic states, or syntactic structures.
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
TLDR
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
...
1
2
3
4
5
...