• Corpus ID: 220495729

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

  title={Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation},
  author={Zhiwei Deng and Karthik Narasimhan and Olga Russakovsky},
The ability to perform effective planning is crucial for building an instruction-following agent. When navigating through a new environment, an agent is challenged with (1) connecting the natural language instructions with its progressively growing knowledge of the world; and (2) performing long-range planning and decision making in the form of effective exploration and error correction. Current methods are still limited on both fronts despite extensive efforts. In this paper, we introduce the… 

Figures and Tables from this paper

Structured Scene Memory for Vision-Language Navigation

This work proposes a crucial architecture, called Structured Scene Memory (SSM), which is compartmentalized enough to accurately memorize the percepts during navigation and serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.

SOON: Scenario Oriented Object Navigation with Graph-based Exploration

A novel graph-based exploration (GBE) method is proposed that outperforms various state-of-the-arts on both FAO and R2R datasets and the ablation studies on FAO validates the quality of the dataset.

Topological Planning with Transformers for Vision-and-Language Navigation

This work proposes a modular approach to VLN using topological maps that leverages attention mechanisms to predict a navigation plan in the map, and generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON and improves the success rate on the fine-grained VLN benchmark R2R.

Collaborative Visual Navigation

This work proposes a large-scale 3D dataset, CollaVN, for multi-agent visual navigation (MAVN), and proposes a memory-augmented communication framework that allows agents to make better use of their past communication information, enabling more efficient collaboration and robust long-term planning.

Reinforced Structured State-Evolution for Vision-Language Navigation

A novel Structured state-Evolution (SEvol) model to effectively maintain the environment layout clues for VLN and the Structured Evolving Module (SEM) is proposed to maintain the structured graph-based state during navigation, where the state is gradually evolved to learn the object-level spatial-temporal relationship.

Rethinking the Spatial Route Prior in Vision-and-Language Navigation

This work addresses the task of VLN from a previouslyignored aspect, namely the spatial route prior of the navigation scenes, and proposes a sequential-decision variant and an exploreand-exploit scheme that curates a compact and informative sub-graph to exploit.

History Aware Multimodal Transformer for Vision-and-Language Navigation

A History Aware Multimodal Transformer (HAMT) is introduced to incorporate a long-horizon history into multimodal decision making for vision-and-language navigation and achieves new state of the art on a broad range of VLN tasks.

Multimodal attention networks for low-level vision-and-language navigation

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

To bridge the discrete-to-continuous gap, a predictor is proposed to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environments.



Cognitive Mapping and Planning for Visual Navigation

The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world.

The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation

This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.

Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation

This work finds that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which they dub as action priors, and proposes a path sampling method based on random walks to augment the data.

Multi-View Learning for Vision-and-Language Navigation

A novel training paradigm, Learn from EveryOne (LEO), which leverages multiple instructions (as different views) for the same trajectory to resolve language ambiguity and improve generalization is presented.

Vision-and-Dialog Navigation

This work introduces Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments and establishes an initial, multi-modal sequence-to-sequence model.

Learning to Navigate in Cities Without a Map

This work proposes a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities, and presents an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage.

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Semi-parametric Topological Memory for Navigation

A new memory architecture for navigation in previously unseen environments, inspired by landmark-based navigation in animals, that consists of a (non-parametric) graph with nodes corresponding to locations in the environment and a deep network capable of retrieving nodes from the graph based on observations.

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments.