Robust Navigation with Language Pretraining and Stochastic Sampling

@inproceedings{Li2019RobustNW,
  title={Robust Navigation with Language Pretraining and Stochastic Sampling},
  author={Xiujun Li and Chunyuan Li and Qiaolin Xia and Yonatan Bisk and Asli Çelikyilmaz and Jianfeng Gao and Noah A. Smith and Yejin Choi},
  booktitle={EMNLP},
  year={2019}
}
Core to the vision-and-language navigation (VLN) challenge is building robust instruction representations and action decoding schemes, which can generalize well to previously unseen instructions and environments. In this paper, we report two simple but highly effective methods to address these challenges and lead to a new state-of-the-art performance. First, we adapt large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions… Expand
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training
TLDR
This paper presents the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks, which leads to significant improvement over existing methods, achieving a new state of the art. Expand
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data andExpand
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
TLDR
A cross-modal grounding module is designed, which is composed of two complementary attention mechanisms, to equip the agent with a better ability to track the correspondence between the textual and visual modalities and further exploit the advantages of both these two learning schemes via adversarial learning. Expand
VISION-AND-LANGUAGE NAVIGATION WITH BAYES’ RULE
Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies haveExpand
Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule
TLDR
A generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all possible sequences of vocabulary tokens given action and the transition history is designed and investigated. Expand
Multimodal Attention Networks for Low-Level Vision-and-Language Navigation
TLDR
“Perceive, Transform, and Act” (PTA) is devised: a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities – natural language, images, and low-level actions for the agent control. Expand
Vision-Language Navigation with Random Environmental Mixup
TLDR
The Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment, is proposed, which is the best existing approach on the standard VLN benchmark. Expand
Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
TLDR
This work describes an effective algorithm to generate counterfactual observations on the fly for VLN, as linear combinations of existing environments, and shows that this technique provides significant improvements in generalisation on benchmarks for Room-to-Room navigation and Embodied Question Answering. Expand
Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks
TLDR
AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Expand
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
TLDR
The size, scope and detail of Room-Across-Room (RxR) dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 22 REFERENCES
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
TLDR
This paper presents a generalizable navigational agent, trained in two stages via mixed imitation and reinforcement learning, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard. Expand
Speaker-Follower Models for Vision-and-Language Navigation
TLDR
Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark. Expand
Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation
We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the 2018 Room-to-Room (R2R)Expand
The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation
TLDR
This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection. Expand
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
TLDR
A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL), and a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions is introduced. Expand
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
TLDR
A novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task and significantly outperforms the baselines and achieves the best on the real- world Room-to-Room dataset. Expand
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
TLDR
A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. Expand
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
TLDR
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Learning models for following natural language directions in unknown environments
TLDR
A novel learning framework is proposed that enables robots to successfully follow natural language route directions without any previous knowledge of the environment by learning and performing inference over a latent environment model. Expand
...
1
2
3
...