Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
@article{Chen2022LearningFU, title={Learning from Unlabeled 3D Environments for Vision-and-Language Navigation}, author={Shizhe Chen and Pierre-Louis Guhur and Makarand Tapaswi and Cordelia Schmid and Ivan Laptev}, journal={ArXiv}, year={2022}, volume={abs/2208.11781} }
. In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically…
Figures and Tables from this paper
4 Citations
U SING CLIP FOR Z ERO -S
- Computer Science
- 2022
This work examines CLIP’s capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes, and demonstrates the navigational capability of CLIP.
RREx-BoT: Remote Referring Expressions with a Bag of Tricks
- Computer Science
- 2023
This analysis outlines a “bag of tricks” essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
Curriculum Vitae
- MedicineAnalysis as a Tool in Mathematical Physics
- 2020
This paper presents a meta-modelling and visual recognition of human actions and interactions for motion Interpretation using Spatio-Temporal Image Features for Motion Interpretation.
U SING CLIP FOR Z ERO -S
- Computer Science
- 2022
This work examines CLIP’s capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes, and demonstrates the navigational capability of CLIP.
References
SHOWING 1-10 OF 57 REFERENCES
Airbert: In-domain Pretraining for Vision-and-Language Navigation
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
In this work, BnB1 is introduced, a large-scale and diverse in-domain VLN dataset that is used to pretrain the Airbert2 model that can be adapted to discriminative and generative settings and outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks.
Envedit: Environment Editing for Vision-and-Language Navigation
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This work proposes Envedit, a data augmentation method that cre-ates new environments by editing existing environments, which are used to train a more generalizable agent and ensemble the VLN agents augmented on different edited environments and show that these edit methods are complementary.
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper presents the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks, which leads to significant improvement over existing methods, achieving a new state of the art.
Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
- Computer Science2021 IEEE International Conference on Robotics and Automation (ICRA)
- 2021
This work lifts the agent off the navigation graph and proposes a more complex VLN setting in continuous 3D reconstructed environments and shows that by using layered decision making, modularized training, and decoupling reasoning and imitation, the proposed Hierarchical Cross-Modal agent outperforms existing baselines in all key metrics and sets a new benchmark for Robo-VLN.
Vision-Language Navigation with Random Environmental Mixup
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
The experimental results on benchmark datasets demonstrate that the augmentation data via REM help the agent reduce its performance gap between the seen and unseen environment and improve the overall performance, making the model the best existing approach on the standard VLN benchmark.
Sim-to-Real Transfer for Vision-and-Language Navigation
- Computer ScienceCoRL
- 2020
To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, a subgoal model is proposed to identify nearby waypoints, and domain randomization is used to mitigate visual domain differences.
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
- Computer ScienceNAACL
- 2019
This paper presents a generalizable navigational agent, trained in two stages via mixed imitation and reinforcement learning, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
- Computer ScienceEMNLP
- 2020
The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.
Structured Scene Memory for Vision-Language Navigation
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work proposes a crucial architecture, called Structured Scene Memory (SSM), which is compartmentalized enough to accurately memorize the percepts during navigation and serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
Active Visual Information Gathering for Vision-Language Navigation
- Computer ScienceECCV
- 2020
This work proposes an end-to-end framework for learning an exploration policy that decides i) when and where to explore, ii) what information is worth gathering during exploration, and iii) how to adjust the navigation decision after the exploration.