Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

  title={Learning from Unlabeled 3D Environments for Vision-and-Language Navigation},
  author={Shizhe Chen and Pierre-Louis Guhur and Makarand Tapaswi and Cordelia Schmid and Ivan Laptev},
. In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically… 


This work examines CLIP’s capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes, and demonstrates the navigational capability of CLIP.

RREx-BoT: Remote Referring Expressions with a Bag of Tricks

This analysis outlines a “bag of tricks” essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.

Curriculum Vitae

This paper presents a meta-modelling and visual recognition of human actions and interactions for motion Interpretation using Spatio-Temporal Image Features for Motion Interpretation.


This work examines CLIP’s capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes, and demonstrates the navigational capability of CLIP.



Airbert: In-domain Pretraining for Vision-and-Language Navigation

In this work, BnB1 is introduced, a large-scale and diverse in-domain VLN dataset that is used to pretrain the Airbert2 model that can be adapted to discriminative and generative settings and outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks.

Envedit: Environment Editing for Vision-and-Language Navigation

This work proposes Envedit, a data augmentation method that cre-ates new environments by editing existing environments, which are used to train a more generalizable agent and ensemble the VLN agents augmented on different edited environments and show that these edit methods are complementary.

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

This paper presents the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks, which leads to significant improvement over existing methods, achieving a new state of the art.

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation

This work lifts the agent off the navigation graph and proposes a more complex VLN setting in continuous 3D reconstructed environments and shows that by using layered decision making, modularized training, and decoupling reasoning and imitation, the proposed Hierarchical Cross-Modal agent outperforms existing baselines in all key metrics and sets a new benchmark for Robo-VLN.

Vision-Language Navigation with Random Environmental Mixup

The experimental results on benchmark datasets demonstrate that the augmentation data via REM help the agent reduce its performance gap between the seen and unseen environment and improve the overall performance, making the model the best existing approach on the standard VLN benchmark.

Sim-to-Real Transfer for Vision-and-Language Navigation

To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, a subgoal model is proposed to identify nearby waypoints, and domain randomization is used to mitigate visual domain differences.

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

This paper presents a generalizable navigational agent, trained in two stages via mixed imitation and reinforcement learning, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.

Structured Scene Memory for Vision-Language Navigation

This work proposes a crucial architecture, called Structured Scene Memory (SSM), which is compartmentalized enough to accurately memorize the percepts during navigation and serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.

Active Visual Information Gathering for Vision-Language Navigation

This work proposes an end-to-end framework for learning an exploration policy that decides i) when and where to explore, ii) what information is worth gathering during exploration, and iii) how to adjust the navigation decision after the exploration.