ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

@article{Shridhar2020ALFREDAB,
  title={ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks},
  author={Mohit Shridhar and Jesse Thomason and Daniel Gordon and Yonatan Bisk and Winson Han and Roozbeh Mottaghi and Luke Zettlemoyer and Dieter Fox},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={10737-10746}
}
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives… 

Figures and Tables from this paper

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
TLDR
ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld and then execute goals from the ALFRED benchmark in a rich visual environment, enables the creation of a new BUTLER agent whose abstract knowledge corresponds directly to concrete, visually grounded actions.
Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions
TLDR
This work empirically demonstrates that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases, and suggests contextualized language models may provide strong planning modules for grounded virtual agents.
A neural model of task compositionality with natural language instructions
TLDR
This is the first experimentally testable model of how language can structure sensorimotor representations to allow for task compositionality, and it is found that the resulting neural representations capture the semantic structure of interrelated tasks even for novel tasks.
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
TLDR
Embodied BERT (EmBERT) is presented, a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for languageconditioned task completion and bridge the gap between successful objectcentric navigation models used for noninteractive agents and the language-guided visual task completion benchmark, ALFRED.
FILM: Following Instructions in Language with Modular Methods
TLDR
The findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.
MOCA: A Modular Object-Centric Approach for Interactive Instruction Following
TLDR
This work proposes a modular architecture that decouples the task into visual perception and action policy, and name it as MOCA, a Modular Object-Centric Approach, and empirically validate that it outperforms prior arts by significant margins in all metrics with good generalization performance.
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
TLDR
This paper investigates the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps and proposes a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks
TLDR
A new method, which outperforms the previous methods by a large margin, based on a combination of several new ideas, which considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction.
LEBP - Language Expectation & Binding Policy: A Two-Stream Framework for Embodied Vision-and-Language Interaction Task Learning Agents
TLDR
The proposed LEBP – Language Expectation with Binding Policy Module is proposed, which achieves comparable performance to currently published SOTA methods and can avoid large decay from seen scenarios to unseen scenarios.
Are We There Yet? Learning to Localize in Embodied Instruction Following
TLDR
This study augment the agent’s field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep, and improve language grounding by introducing a pre-trained object detection module to the model pipeline.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
Gated-Attention Architectures for Task-Oriented Language Grounding
TLDR
An end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input.
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
TLDR
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.
Speaker-Follower Models for Vision-and-Language Navigation
TLDR
Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.
Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions
TLDR
This paper presents a model that takes into account the variations in natural language and ambiguities in grounding them to robotic instructions with appropriate environment context and task constraints, based on an energy function that encodes such properties in a form isomorphic to a conditional random field.
Towards a Dataset for Human Computer Communication via Grounded Language Acquisition
TLDR
This paper discusses the work towards building a dataset that enables an empirical approach to studying the relation between natural language, actions, and plans; and introduces a problem formulation that allows us to take meaningful steps towards addressing the open problems listed above.
Grounding Robot Plans from Natural Language Instructions with Incomplete World Knowledge
TLDR
A probabilistic model is introduced that utilizes background knowledge to infer latent or missing plan constituents based on semantic co-associations learned from noisy textual corpora of task descriptions to enable robots to interpret and execute high-level tasks conveyed using natural language instructions.
Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions
TLDR
MARCO, an agent that follows free-form, natural language route instructions by representing and executing a sequence of compound action specifications that model which actions to take under which conditions, is presented.
Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
TLDR
“Help, Anna!” (HANNA), an interactive photo-realistic simulator in which an agent fulfills object-finding tasks by requesting and interpreting natural language-and-vision assistance, and an imitation learning algorithm that teaches the agent to avoid repeating past mistakes while simultaneously predicting its own chances of making future progress.
Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions
TLDR
This paper shows semantic parsing can be used within a grounded CCG semantic parsing approach that learns a joint model of meaning and context for interpreting and executing natural language instructions, using various types of weak supervision.
Neural Modular Control for Embodied Question Answering
TLDR
This work uses imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning, for learning policies for navigation over long planning horizons from language input.
...
...