Factorizing Perception and Policy for Interactive Instruction Following

  title={Factorizing Perception and Policy for Interactive Instruction Following},
  author={Kunal Pratap Singh and Suvaansh Bhambri and Byeonghwi Kim and Roozbeh Mottaghi and Jonghyun Choi},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
Performing simple household tasks based on language directives is very natural to humans, yet it remains an open challenge for AI agents. The ‘interactive instruction following’ task attempts to make progress towards building agents that jointly navigate, interact, and reason in the environment at every step. To address the multifaceted problem, we propose a model that factorizes the task into interactive perception and action policy streams with enhanced components and name it as MOCA, a… 
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
A model-agnostic milestone-based task tracker (M-T RACK) to guide the agent and monitor its progress, and a milestone checker that system-ically checks the agent’s progress in its current milestone and determines when to proceed to the next.
SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following
benchmarking evaluates the impacts of different components of, or options for, the vision-and-language learning model and shows the effectiveness of pretraining strategies, and the robustness of the framework to novel scenarios is shown.
A Simple Approach for Visual Rearrangement: 3D Mapping and Semantic Search
Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent’s ability to rearrange objects in a room to a desired goal based solely on


ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld and then execute goals from the ALFRED benchmark in a rich visual environment, enables the creation of a new BUTLER agent whose abstract knowledge corresponds directly to concrete, visually grounded actions.
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
This work learns a single model to jointly reason about linguistic and visual input in a contextual bandit setting to train a neural network agent and shows significant improvements over supervised learning and common reinforcement learning variants.
Gated-Attention Architectures for Task-Oriented Language Grounding
An end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input.
Embodied Question Answering
A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question.
Rearrangement: A Challenge for Embodied AI
A framework for research and evaluation in Embodied AI is described, based on a canonical task: Rearrangement, that can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings.
On Evaluation of Embodied Navigation Agents
The present document summarizes the consensus recommendations of a working group to study empirical methodology in navigation research and discusses different problem statements and the role of generalization, present evaluation measures, and provides standard scenarios that can be used for benchmarking.
Speaker-Follower Models for Vision-and-Language Navigation
Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.
Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions
MARCO, an agent that follows free-form, natural language route instructions by representing and executing a sequence of compound action specifications that model which actions to take under which conditions, is presented.
Robust Navigation with Language Pretraining and Stochastic Sampling
This paper adapts large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions and proposes a stochastic sampling scheme to reduce the considerable gap between the expert actions in training and sampled actions in test.