Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

  title={Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments},
  author={Sonia Raychaudhuri and Saim Wani and Shivansh Patel and Unnat Jain and Angel X. Chang},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle ‘off the path’ scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent’s location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work… 

Cross-modal Map Learning for Vision and Language Navigation

A cross-modal map learning model for vision-and-language navigation is proposed that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of way-points.

Graph based Environment Representation for Vision-and-Language Navigation in Continuous Environments

A new environment representation for Vision-and-Language Navigation in Continuous Environments (VLN-CE) is proposed, incorporating an Environment Representation Graph (ERG) through object detection to express the environment in semantic level and a new cross-modal attention navigation framework is proposed.

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

A multi-granularity map, which contains both object-grained details and semantic classes, to represent objects more comprehensively and a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map.

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

This work explores the gap in performance between the standard VLN setting built on topological environments where navigation is abstracted away and the VLLN-CE setting where agents must navigate continuous 3D environments using low-level actions, and demonstrates the potential for this direction.

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

This work proposes a new Disentanglement framework with Decoupled Labels (DDL) for VLN, and designs a Disentangled Decoding Module to guide discriminative feature extraction and help alignment of multi-modalities.

MLANet: Multi-Level Attention Network with Sub-instruction for Continuous Vision-and-Language Navigation

This work designs a Fast Sub-instruction Algorithm (FSA) to segment the raw instruction into sub-instructions and generates a new sub-Instruction dataset named ``FSASub", which is annotation-free and faster than the current method by 70 times, thus fitting the real-time requirement in continuous VLN.

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

This work proposes to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D and fine-tune a pretrained language model using pseudo object labels as prompts to alleviate the cross-modal gap in instruction generation.

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

It is shown that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions, which reduces the absolute discrete-to-continuous gap by 11.76% through extensive experiments.

Human-in-the-loop Robotic Grasping Using BERT Scene Representation

A human-in-the-loop framework for robotic grasping in cluttered scenes, investigating a language interface to the grasping process, which allows the user to intervene by natural language commands.



Sub-Instruction Aware Vision-and-Language Navigation

This work provides agents with fine-grained annotations during training and finds that they are able to follow the instruction better and have a higher chance of reaching the target at test time.

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

A language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions is developed, suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.

Topological Planning with Transformers for Vision-and-Language Navigation

This work proposes a modular approach to VLN using topological maps that leverages attention mechanisms to predict a navigation plan in the map, and generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

Object-and-Action Aware Model for Visual Language Navigation

An Object-and-Action Aware Model (OAAM) is proposed that processes these two different forms of natural language based instruction separately and enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly.

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

This work highlights shortcomings of current metrics for the Room-to-Room dataset and proposes a new metric, Coverage weighted by Length Score (CLS), and shows that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion.

Speaker-Follower Models for Vision-and-Language Navigation

Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

Waypoint Models for Instruction-guided Navigation in Continuous Environments

A class of language-conditioned waypoint prediction networks is developed to examine the role of action spaces in language-guided visual navigation and finds more expressive models result in simpler, faster to execute trajectories, but lower-level actions can achieve better navigation metrics by approximating shortest paths better.

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

It is proposed to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.