Robust Navigation with Language Pretraining and Stochastic Sampling

  title={Robust Navigation with Language Pretraining and Stochastic Sampling},
  author={Xiujun Li and Chunyuan Li and Qiaolin Xia and Yonatan Bisk and Asli Celikyilmaz and Jianfeng Gao and Noah A. Smith and Yejin Choi},
Core to the vision-and-language navigation (VLN) challenge is building robust instruction representations and action decoding schemes, which can generalize well to previously unseen instructions and environments. In this paper, we report two simple but highly effective methods to address these challenges and lead to a new state-of-the-art performance. First, we adapt large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions… 

Figures and Tables from this paper

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

This paper presents the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks, which leads to significant improvement over existing methods, achieving a new state of the art.

Airbert: In-domain Pretraining for Vision-and-Language Navigation

In this work, BnB1 is introduced, a large-scale and diverse in-domain VLN dataset that is used to pretrain the Airbert2 model that can be adapted to discriminative and generative settings and outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks.

Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

A cross-modal grounding module is designed, which is composed of two complementary attention mechanisms, to equip the agent with a better ability to track the correspondence between the textual and visual modalities and further exploit the advantages of both these two learning schemes via adversarial learning.

Anticipating the Unseen Discrepancy for Vision and Language Navigation

A semi-supervised framework DAVIS that leverages visual consistency signals across similar semantic observations and enhances the basic mixture of imitation and reinforcement learning with Momentum Contrast to encourage stable decision-making on similar observations under a joint training stage and a test-time adaptation stage.


This paper designs and investigates a generative language-grounded policy which computes the distribution over all possible instructions given action and the transition history and shows that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) dataset.

Curriculum Learning for Vision-and-Language Navigation

A novel curriculumbased training paradigm for VLN tasks that can balance human prior knowledge and agent learning progress about training samples is proposed and the principle of curriculum design is developed and re-arrange the benchmark Room-to-Room (R2R) dataset to make it suitable for curriculum training.

Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method

This paper proposes a snapshot-based ensemble solution that leverages predictions among multiple snapshots of the existing state-of-the-art (SOTA) model BERT and a past-action-aware modification that achieves the new SOTA performance in the R2R dataset challenge.

VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation

A recurrent BERT model that is time-aware for use in VLN is proposed that can replace more complex encoder-decoder models to achieve state-of-the-art results and can generalised to other transformer-based architectures.

Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule

A generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all possible sequences of vocabulary tokens given action and the transition history is designed and investigated.

History Aware Multimodal Transformer for Vision-and-Language Navigation

A History Aware Multimodal Transformer (HAMT) is introduced to incorporate a long-horizon history into multimodal decision making for vision-and-language navigation and achieves new state of the art on a broad range of VLN tasks.



Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

This paper presents a generalizable navigational agent, trained in two stages via mixed imitation and reinforcement learning, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.

Speaker-Follower Models for Vision-and-Language Navigation

Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation

We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the 2018 Room-to-Room (R2R)

The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation

This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL), and a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions is introduced.

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation

A novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task and significantly outperforms the baselines and achieves the best on the real- world Room-to-Room dataset.

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Learning models for following natural language directions in unknown environments

A novel learning framework is proposed that enables robots to successfully follow natural language route directions without any previous knowledge of the environment by learning and performing inference over a latent environment model.