Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
@article{Ma2019SelfMonitoringNA, title={Self-Monitoring Navigation Agent via Auxiliary Progress Estimation}, author={Chih-Yao Ma and Jiasen Lu and Zuxuan Wu and Ghassan Al-Regib and Zsolt Kira and Richard Socher and Caiming Xiong}, journal={ArXiv}, year={2019}, volume={abs/1901.03035} }
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. [] Key Method In this paper, we introduce a self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the…
Figures and Tables from this paper
142 Citations
Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
- 2020
This work finds that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which they dub as action priors, and proposes a path sampling method based on random walks to augment the data.
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
- Computer ScienceArXiv
- 2020
A cross-modal grounding module is designed, which is composed of two complementary attention mechanisms, to equip the agent with a better ability to track the correspondence between the textual and visual modalities and further exploit the advantages of both these two learning schemes via adversarial learning.
Diagnosing the Environment Bias in Vision-and-Language Navigation
- Computer ScienceIJCAI
- 2020
This work designs novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias in VLN models, and explores several kinds of semantic representations that contain less low-level visual information.
Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments.
The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.
Vision Language Navigation with Multi-granularity Observation and Auxiliary Reasoning Tasks
- Computer Science
- 2020
Multi-granularity Auxiliary Reasoning Navigation (MG-AuxRN), a navigation framework which employs four auxiliary reasoning tasks to reason over global image features and detected object features, and empirically demonstrates that an agent trained with self-supervised Auxiliary reasoning tasks substantially outperforms the previous state-of-the-art methods.
Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
- Computer ScienceEMNLP
- 2021
This work proposes a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.
Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the 2018 Room-to-Room (R2R)…
Rethinking the Spatial Route Prior in Vision-and-Language Navigation
- Computer ScienceArXiv
- 2021
This work addresses the task of VLN from a previouslyignored aspect, namely the spatial route prior of the navigation scenes, and proposes a sequential-decision variant and an exploreand-exploit scheme that curates a compact and informative sub-graph to exploit.
Vision-Language Navigation Policy Learning and Adaptation
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2021
A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL) and a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions is introduced.
References
SHOWING 1-10 OF 57 REFERENCES
The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.
Target-driven visual navigation in indoor scenes using deep reinforcement learning
- Computer Science2017 IEEE International Conference on Robotics and Automation (ICRA)
- 2017
This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine.
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL), and a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions is introduced.
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.
Visual Representations for Semantic Target Driven Navigation
- Computer Science2019 International Conference on Robotics and Automation (ICRA)
- 2019
This work proposes to use semantic segmentation and detection masks as observations obtained by state-of-the-art computer vision algorithms and use a deep network to learn navigation policies on top of representations that capture spatial layout and semantic contextual cues.
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
- Computer ScienceAAAI
- 2016
This work introduces a multi-level aligner that empowers the alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) to translate natural language instructions to action sequences based upon a representation of the observable world state.
Speaker-Follower Models for Vision-and-Language Navigation
- Computer ScienceNeurIPS
- 2018
Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.
Learning to Navigate in Cities Without a Map
- Computer ScienceNeurIPS
- 2018
This work proposes a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities, and presents an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage.
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
- Computer ScienceECCV
- 2018
A novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task and significantly outperforms the baselines and achieves the best on the real- world Room-to-Room dataset.
Visual Semantic Planning Using Deep Successor Representations
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This work addresses the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state, and develops a deep predictive model based on successor representations.