• Corpus ID: 235497558

Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following

  title={Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following},
  author={Byeonghwi Kim and Suvaansh Bhambri and Kunal Pratap Singh},
We address the interactive instruction following task [4, 9, 8] which requires an agent to navigate through an environment, interact with objects, and complete long-horizon tasks, following natural language instructions with egocentric vision. To successfully achieve a goal in the interactive instruction following task, the agent should infer a sequence of actions and object interactions. When performing actions, a small field of view often limits the agent’s understanding of an environment… 

Figures and Tables from this paper

Learning to Act with Affordance-Aware Multimodal Neural SLAM
This work proposes a Neural SLAM approach that utilizes several modalities for exploration, predicts an affordance-aware semantic map, and plans over it at the same time, and significantly improves exploration efficiency, leads to robust long-horizon planning, and enables effective vision-and-language grounding.
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
Embodied BERT (EmBERT) is presented, a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for languageconditioned task completion and bridge the gap between successful objectcentric navigation models used for noninteractive agents and the language-guided visual task completion benchmark, ALFRED.
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
A model-agnostic milestone-based task tracker (M-T RACK) to guide the agent and monitor its progress, and a milestone checker that system-ically checks the agent’s progress in its current milestone and determines when to proceed to the next.
FILM: Following Instructions in Language with Modular Methods
The findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.
LEBP - Language Expectation & Binding Policy: A Two-Stream Framework for Embodied Vision-and-Language Interaction Task Learning Agents
The proposed LEBP – Language Expectation with Binding Policy Module is proposed, which achieves comparable performance to currently published SOTA methods and can avoid large decay from seen scenarios to unseen scenarios.
On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets
It is observed that augmenting a transformer model for this task with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED.
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution
A persistent spatial semantic representation method is proposed that enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks, despite completely avoiding the commonly used step-by-step instructions.
TEACh: Task-driven Embodied Agents that Chat
TEACh, a dataset of over 3,000 human–human, interactive dialogues to complete household tasks in simulation, is introduced and initial models’ abilities in dialogue understanding, language grounding, and task execution are evaluated.
Skill Induction and Planning with Latent Language
A framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making, achieves performance comparable state-of-the-art models on ALFRED success rate and outperforming several recent methods with access to ground-truth plans.
Are you doing what I say? On modalities alignment in ALFRED
This work introduces approaches aimed at improving model alignment and demonstrates how improved alignment, improves end task performance.


Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks
A new method, which outperforms the previous methods by a large margin, based on a combination of several new ideas, which considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction.
IQA: Visual Question Answering in Interactive Environments
The Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction, is proposed, and outperforms popular single controller based methods on IQUAD V1.
Visual Semantic Planning Using Deep Successor Representations
This work addresses the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state, and develops a deep predictive model based on successor representations.
Learning Object Relation Graph and Tentative Policy for Visual Navigation
Three complementary techniques, object relation graph (ORG), trial-driven imitation learning (IL), and a memory-augmented tentative policy network (TPN), which improves visual representation learning by integrating object relationships, including category closeness and spatial correlations are proposed.
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
On Evaluation of Embodied Navigation Agents
The present document summarizes the consensus recommendations of a working group to study empirical methodology in navigation research and discusses different problem statements and the role of generalization, present evaluation measures, and provides standard scenarios that can be used for benchmarking.
AI2-THOR: An Interactive 3D Environment for Visual AI
AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks and facilitate building visually intelligent models.
AutoAugment: Learning Augmentation Policies from Data
This paper describes a simple procedure called AutoAugment to automatically search for improved data augmentation policies, which achieves state-of-the-art accuracy on CIFAR-10, CIFar-100, SVHN, and ImageNet (without additional data).
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.