Sub-Instruction Aware Vision-and-Language Navigation

  title={Sub-Instruction Aware Vision-and-Language Navigation},
  author={Yicong Hong and Cristian Rodriguez-Opazo and Qi Wu and Stephen Gould},
Vision-and-language navigation requires an agent to navigate through a real 3D environment following a given natural language instruction. Despite significant advances, few previous works are able to fully utilize the strong correspondence between the visual and textual sequences. Meanwhile, due to the lack of intermediate supervision, the agent's performance at following each part of the instruction remains untrackable during navigation. In this work, we focus on the granularity of the visual… 

Figures and Tables from this paper

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

This paper proposes a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes and achieves the new state-of-the-art on Room-Across-Room dataset.

Structured Scene Memory for Vision-Language Navigation

This work proposes a crucial architecture, called Structured Scene Memory (SSM), which is compartmentalized enough to accurately memorize the percepts during navigation and serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.

VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation

A recurrent BERT model that is time-aware for use in VLN is proposed that can replace more complex encoder-decoder models to achieve state-of-the-art results and can generalised to other transformer-based architectures.

Language and Visual Entity Relationship Graph for Agent Navigation

A novel Language and Visual Entity Relationship Graph is proposed for modelling the inter-modal relationships between text and vision, and the intra-modals among visual entities, and a message passing algorithm for propagating information between language elements and visual entities in the graph is proposed.

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.

How Faithful Are Current Vision Language Navigation Models: A Study on R2R and R4R Datasets

The Speaker-Follower model as well as the environmental dropout model are used in combination with the Room-for-Room (R4R) data augmentation technique to see an improvement in generalization, i.e. a higher success rate on unseen environments.

Emerging Trends of Multimodal Research in Vision and Language

A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation.

Deep Learning for Embodied Vision Navigation: A Survey

This paper presents a comprehensive review of embodied navigation tasks and the recent progress in deep learning based methods, which includes two major tasks: target-oriented navigation and the instruction-oriented Navigation.

Vision-Language Navigation with Random Environmental Mixup

The experimental results on benchmark datasets demonstrate that the augmentation data via REM help the agent reduce its performance gap between the seen and unseen environment and improve the overall performance, making the model the best existing approach on the standard VLN benchmark.

Towards Navigation by Reasoning over Spatial Configurations

This work proposes a neural agent that uses the elements of spatial configurations and investigate their influence on the navigation agent’s reasoning ability, and model the sequential execution order and align visual objects with spatial configurations in the instruction.



Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

This paper proposes to exploit dynamic convolutional filters to encode the visual information and the lingual description in an efficient way into a series of low-level, agent friendly actions, and attempts to categorize recent work on VLN depending on their architectural choices.

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

This work highlights shortcomings of current metrics for the Room-to-Room dataset and proposes a new metric, Coverage weighted by Length Score (CLS), and shows that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion.

Transferable Representation Learning in Vision-and-Language Navigation

This approach adapts pre-trained vision and language representations to relevant in-domain tasks making them more effective for VLN, and improves competitive agents in R2R as measured by the success rate weighted by path length (SPL) metric.

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

This paper presents the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks, which leads to significant improvement over existing methods, achieving a new state of the art.

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Speaker-Follower Models for Vision-and-Language Navigation

Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

It is proposed to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.

The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation

This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.