Waypoint Models for Instruction-guided Navigation in Continuous Environments

@article{Krantz2021WaypointMF,
  title={Waypoint Models for Instruction-guided Navigation in Continuous Environments},
  author={Jacob Krantz and Aaron Gokaslan and Dhruv Batra and Stefan Lee and Oleksandr Maksymets},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={15142-15151}
}
Little inquiry has explicitly addressed the role of action spaces in language-guided visual navigation – either in terms of its effect on navigation success or the efficiency with which a robotic agent could execute the resulting trajectory. Building on the recently released VLN-CE [24] setting for instruction following in continuous environments, we develop a class of language-conditioned waypoint prediction networks to examine this question. We vary the expressivity of these models to explore… 

Figures and Tables from this paper

1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)

The methods of the winning entry of the RxR-Habitat Competition in CVPR 2022 are presented, which takes several recent advances of Vision-and-Language Navigation to improve the performance such as pretraining based on large-scale synthetic in-domain dataset, environment-level data augmentation and snapshot model ensemble.

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

This work explores the gap in performance between the standard VLN setting built on topological environments where navigation is abstracted away and the VLLN-CE setting where agents must navigate continuous 3D environments using low-level actions, and demonstrates the potential for this direction.

Target-Driven Structured Transformer Planner for Vision-Language Navigation

This article devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments) and design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning.

Iterative Vision-and-Language Navigation

It is found that extending the implicit memory of high-performing transformer VLN agents is not sufficient for IVLN, but agents that build maps can benefit from environment persistence, motivating a renewed focus on map-building agents in VLn.

Find a Way Forward: a Language-Guided Semantic Map Navigator

This paper introduces the map-language navigation task where an agent executes natural language instructions and moves to the target position based only on a given 3D semantic map and designs the instruction-aware Path Proposal and Discrimination model (iPPD), which can naturally avoid error accumulation compared with single-step greedy decision methods.

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

It is shown that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions, which reduces the absolute discrete-to-continuous gap by 11.76% through extensive experiments.

Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

This work proposes a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.

ASSISTER: Assistive Navigation via Conditional Instruction Generation

This work introduces a novel vision-and-language navigation task of learning to provide real-time guidance to a blind follower situated in complex dynamic navigation scenarios and presents ASSISTER, an imitation-learned agent that can embody such effective guidance.

Cross-modal Map Learning for Vision and Language Navigation

A cross-modal map learning model for vision-and-language navigation is proposed that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of way-points.

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

A multi-granularity map, which contains both object-grained details and semantic classes, to represent objects more comprehensively and a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map.

References

SHOWING 1-10 OF 40 REFERENCES

Sim-to-Real Transfer for Vision-and-Language Navigation

To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, a subgoal model is proposed to identify nearby waypoints, and domain randomization is used to mitigate visual domain differences.

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

A language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions is developed, suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation

This work lifts the agent off the navigation graph and proposes a more complex VLN setting in continuous 3D reconstructed environments and shows that by using layered decision making, modularized training, and decoupling reasoning and imitation, the proposed Hierarchical Cross-Modal agent outperforms existing baselines in all key metrics and sets a new benchmark for Robo-VLN.

Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction

This model predicts interpretable position-visitation distributions indicating where the agent should go during execution and where it should stop, and uses the predicted distributions to select the actions to execute to allow for simple and efficient training using a combination of supervised learning and imitation learning.

Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning

We introduce a method for following high-level navigation instructions by mapping directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time

Are We Making Real Progress in Simulated Environments? Measuring the Sim2Real Gap in Embodied Visual Navigation

The Habitat-PyRobot Bridge is developed, a library for seamless execution of identical code on a simulated agent and a physical robot and a new metric called Sim-vs-Real Correlation Coefficient (SRCC) is presented to quantify sim2real predictivity, which is largely due to AI agents learning to 'cheat' by exploiting simulator imperfections.

Combining Optimal Control and Learning for Visual Navigation in Novel Environments

This work coupling model-based control with learning-based perception produces a series of waypoints that guide the robot to the goal via a collision-free path and demonstrates that the proposed approach can reach goal locations more reliably and efficiently in novel environments as compared to purely geometric mapping-based or end-to-end learning- based alternatives.

Object Goal Navigation using Goal-Oriented Semantic Exploration

A modular system called, `Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently based on the goal object category and outperforms a wide range of baselines including end-to-end learning-based methods as well as modular map- based methods.

ReLMoGen: Integrating Motion Generation in Reinforcement Learning for Mobile Manipulation

It is argued that, by lifting the action space and by leveraging sampling-based motion planners, this work can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space.

Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation

This work presents Success weighted by Completion Time (SCT), a new metric for evaluating navigation performance for mobile robots, and presents RRT*-Unicycle, an algorithm for unicycle dynamics that estimates the fastest collision-free path and completion time from a starting pose to a goal location in an environment containing obstacles.