ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

  title={ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks},
  author={Mohit Shridhar and Jesse Thomason and Daniel Gordon and Yonatan Bisk and Winson Han and Roozbeh Mottaghi and Luke Zettlemoyer and Dieter Fox},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives… 

Figures and Tables from this paper

ALFRED-L: Investigating the Role of Language for Action Learning in Interactive Visual Environments

Evidence is found that sequence-to-sequence and transformer-based models trained on this benchmark are not sufficiently sensitive to changes in input language instructions and models trained with additional augmented trajectories are able to adapt relatively better to such changes ininput language instructions.

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld and then execute goals from the ALFRED benchmark in a rich visual environment, enables the creation of a new BUTLER agent whose abstract knowledge corresponds directly to concrete, visually grounded actions.

On Grounded Planning for Embodied Tasks with Language Models

This paper addresses the question of whether language models have the capacity to generate grounded, executable plans for embodied tasks and demonstrates that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs' ability in grounded planning.

Visual Semantic Planning for Service Robot via Natural Language Instructions

This work utilizes natural language instructions to build a single-mode model, and translates the task into a sequential decision problem, which focuses on generating continuous high-level action sequences directly from the instructions.

ACT-Thor: A Controlled Benchmark for Embodied Action Understanding in Simulated Environments

This work uses the AI2-THOR simulated environment to produce a controlled setup in which an agent has to determine what the correct after-image is among a set of possible candidates, and suggests that only models that have a very structured representation of the actions together with powerful visual features can perform well on the task.

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Embodied BERT (EmBERT) is presented, a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for languageconditioned task completion and bridge the gap between successful objectcentric navigation models used for noninteractive agents and the language-guided visual task completion benchmark, ALFRED.


This is the first experimentally testable model of how language can structure sensorimotor representations to allow for task compositionality, and it is found that the resulting neural representations capture the semantic structure of interrelated tasks even for novel tasks.

FILM: Following Instructions in Language with Modular Methods

The findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.

MOCA: A Modular Object-Centric Approach for Interactive Instruction Following

This work proposes a modular architecture that decouples the task into visual perception and action policy, and name it as MOCA, a Modular Object-Centric Approach, and empirically validate that it outperforms prior arts by significant margins in all metrics with good generalization performance.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.



Gated-Attention Architectures for Task-Oriented Language Grounding

An end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Speaker-Follower Models for Vision-and-Language Navigation

Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions

This paper presents a model that takes into account the variations in natural language and ambiguities in grounding them to robotic instructions with appropriate environment context and task constraints, based on an energy function that encodes such properties in a form isomorphic to a conditional random field.

Towards a Dataset for Human Computer Communication via Grounded Language Acquisition

This paper discusses the work towards building a dataset that enables an empirical approach to studying the relation between natural language, actions, and plans; and introduces a problem formulation that allows us to take meaningful steps towards addressing the open problems listed above.

Grounding Robot Plans from Natural Language Instructions with Incomplete World Knowledge

A probabilistic model is introduced that utilizes background knowledge to infer latent or missing plan constituents based on semantic co-associations learned from noisy textual corpora of task descriptions to enable robots to interpret and execute high-level tasks conveyed using natural language instructions.

Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions

MARCO, an agent that follows free-form, natural language route instructions by representing and executing a sequence of compound action specifications that model which actions to take under which conditions, is presented.

Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning

“Help, Anna!” (HANNA), an interactive photo-realistic simulator in which an agent fulfills object-finding tasks by requesting and interpreting natural language-and-vision assistance, and an imitation learning algorithm that teaches the agent to avoid repeating past mistakes while simultaneously predicting its own chances of making future progress.

Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions

This paper shows semantic parsing can be used within a grounded CCG semantic parsing approach that learns a joint model of meaning and context for interpreting and executing natural language instructions, using various types of weak supervision.

Neural Modular Control for Embodied Question Answering

This work uses imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning, for learning policies for navigation over long planning horizons from language input.