• Corpus ID: 220495729

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

  title={Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation},
  author={Zhiwei Deng and Karthik Narasimhan and Olga Russakovsky},
The ability to perform effective planning is crucial for building an instruction-following agent. When navigating through a new environment, an agent is challenged with (1) connecting the natural language instructions with its progressively growing knowledge of the world; and (2) performing long-range planning and decision making in the form of effective exploration and error correction. Current methods are still limited on both fronts despite extensive efforts. In this paper, we introduce the… 

Figures and Tables from this paper

Structured Scene Memory for Vision-Language Navigation

This work proposes a crucial architecture, called Structured Scene Memory (SSM), which is compartmentalized enough to accurately memorize the percepts during navigation and serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.

Target-Driven Structured Transformer Planner for Vision-Language Navigation

This article devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments) and design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning.

SOON: Scenario Oriented Object Navigation with Graph-based Exploration

A novel graph-based exploration (GBE) method is proposed that outperforms various state-of-the-arts on both FAO and R2R datasets and the ablation studies on FAO validates the quality of the dataset.

Topological Planning with Transformers for Vision-and-Language Navigation

This work proposes a modular approach to VLN using topological maps that leverages attention mechanisms to predict a navigation plan in the map, and generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

Reinforced Structured State-Evolution for Vision-Language Navigation

This paper proposes a novel Structured state-Evolution (SEvol) model, which utilises the graph-based feature to represent the navigation state instead of the vector-based state, and devise a Reinforced Layout clues Miner (RLM) to mine and detect the most crucial layout graph for long-term navigation via a customised reinforcement learning strategy.

Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

A hybrid transformer-recurrence model is presented which creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy.

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

This work proposes a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding, and builds a topological map on-the-fly to enable efficient exploration in global action space.

Collaborative Visual Navigation

This work proposes a large-scale 3D dataset, CollaVN, for multi-agent visual navigation (MAVN), and proposes a memory-augmented communication framework that allows agents to make better use of their past communication information, enabling more efficient collaboration and robust long-term planning.

ULN: Towards Underspecified Vision-and-Language Navigation

A VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module is proposed that is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.

Rethinking the Spatial Route Prior in Vision-and-Language Navigation

This work addresses the task of VLN from a previouslyignored aspect, namely the spatial route prior of the navigation scenes, and proposes a sequential-decision variant and an exploreand-exploit scheme that curates a compact and informative sub-graph to exploit.



Cognitive Mapping and Planning for Visual Navigation

The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world.

The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation

This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.

Multi-View Learning for Vision-and-Language Navigation

A novel training paradigm, Learn from EveryOne (LEO), which leverages multiple instructions (as different views) for the same trajectory to resolve language ambiguity and improve generalization is presented.

Vision-and-Dialog Navigation

This work introduces Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments and establishes an initial, multi-modal sequence-to-sequence model.

Learning to Navigate in Cities Without a Map

This work proposes a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities, and presents an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage.

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Semi-parametric Topological Memory for Navigation

A new memory architecture for navigation in previously unseen environments, inspired by landmark-based navigation in animals, that consists of a (non-parametric) graph with nodes corresponding to locations in the environment and a deep network capable of retrieving nodes from the graph based on observations.

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

This paper presents the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks, which leads to significant improvement over existing methods, achieving a new state of the art.

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments.