1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)

  title={1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)},
  author={Dongyan An and Zun Wang and Yangguang Li and Yi Wang and Yicong Hong and Yan Huang and Liang-Hsun Wang and Jing Shao},
This report presents the methods of the winning entry of the RxR-Habitat Competition in CVPR 2022. The competition addresses the problem of Vision-and-Language Navigation in Continuous Environments (VLN-CE), which requires an agent to follow step-by-step natural language instructions to reach a target. We present a modular plan-and-control approach for the task. Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller… 

Figures and Tables from this paper



Waypoint Models for Instruction-guided Navigation in Continuous Environments

A class of language-conditioned waypoint prediction networks is developed to examine the role of action spaces in language-guided visual navigation and finds more expressive models result in simpler, faster to execute trajectories, but lower-level actions can achieve better navigation metrics by approximating shortest paths better.

History Aware Multimodal Transformer for Vision-and-Language Navigation

A History Aware Multimodal Transformer (HAMT) is introduced to incorporate a long-horizon history into multimodal decision making for vision-and-language navigation and achieves new state of the art on a broad range of VLN tasks.

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

This work explores the gap in performance between the standard VLN setting built on topological environments where navigation is abstracted away and the VLLN-CE setting where agents must navigate continuous 3D environments using low-level actions, and demonstrates the potential for this direction.

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

A language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions is developed, suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.

Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method

This paper proposes a snapshot-based ensemble solution that leverages predictions among multiple snapshots of the existing state-of-the-art (SOTA) model BERT and a past-action-aware modification that achieves the new SOTA performance in the R2R dataset challenge.

Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

This paper introduces a human-annotated fine-grained VLN dataset, namely Landmark-RxR, and introduces a re-initialization mechanism that makes metrics insensitive to difficult points, which can cause the agent to deviate from the correct trajectories.

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

To bridge the discrete-to-continuous gap, a predictor is proposed to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environments.

Envedit: Environment Editing for Vision-and-Language Navigation

This work proposes ENVEDIT, a data augmentation method that creates new environments by editing existing environments, which are used to train a more generalizable agent, and ensemble the VLN agents augmented on different edited environments and show that these edit methods are complementary.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation

This work proposes an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level, namely objects and words, and enables the visual-textual clues to be interpreted in light of the temporal context, which is crucial to multi-round VLN tasks.