• Corpus ID: 57761103

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

@article{Ma2019SelfMonitoringNA,
  title={Self-Monitoring Navigation Agent via Auxiliary Progress Estimation},
  author={Chih-Yao Ma and Jiasen Lu and Zuxuan Wu and Ghassan Al-Regib and Zsolt Kira and Richard Socher and Caiming Xiong},
  journal={ArXiv},
  year={2019},
  volume={abs/1901.03035}
}
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. [] Key Method In this paper, we introduce a self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the…

Figures and Tables from this paper

Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation
TLDR
This work finds that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which they dub as action priors, and proposes a path sampling method based on random walks to augment the data.
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
TLDR
A cross-modal grounding module is designed, which is composed of two complementary attention mechanisms, to equip the agent with a better ability to track the correspondence between the textual and visual modalities and further exploit the advantages of both these two learning schemes via adversarial learning.
Diagnosing the Environment Bias in Vision-and-Language Navigation
TLDR
This work designs novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias in VLN models, and explores several kinds of semantic representations that contain less low-level visual information.
Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks
TLDR
AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments.
The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation
TLDR
This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.
Vision Language Navigation with Multi-granularity Observation and Auxiliary Reasoning Tasks
TLDR
Multi-granularity Auxiliary Reasoning Navigation (MG-AuxRN), a navigation framework which employs four auxiliary reasoning tasks to reason over global image features and detected object features, and empirically demonstrates that an agent trained with self-supervised Auxiliary reasoning tasks substantially outperforms the previous state-of-the-art methods.
Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
TLDR
This work proposes a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.
Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation
We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the 2018 Room-to-Room (R2R)
Rethinking the Spatial Route Prior in Vision-and-Language Navigation
TLDR
This work addresses the task of VLN from a previouslyignored aspect, namely the spatial route prior of the navigation scenes, and proposes a sequential-decision variant and an exploreand-exploit scheme that curates a compact and informative sub-graph to exploit.
Vision-Language Navigation Policy Learning and Adaptation
TLDR
A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL) and a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions is introduced.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 57 REFERENCES
The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation
TLDR
This paper proposes to use a progress monitor developed in prior work as a learnable heuristic for search, and proposes two modules incorporated into an end-to-end architecture that significantly outperforms current state-of-the-art methods using greedy action selection.
Target-driven visual navigation in indoor scenes using deep reinforcement learning
TLDR
This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine.
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
TLDR
A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL), and a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions is introduced.
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
TLDR
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.
Visual Representations for Semantic Target Driven Navigation
TLDR
This work proposes to use semantic segmentation and detection masks as observations obtained by state-of-the-art computer vision algorithms and use a deep network to learn navigation policies on top of representations that capture spatial layout and semantic contextual cues.
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
TLDR
This work introduces a multi-level aligner that empowers the alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) to translate natural language instructions to action sequences based upon a representation of the observable world state.
Speaker-Follower Models for Vision-and-Language Navigation
TLDR
Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.
Learning to Navigate in Cities Without a Map
TLDR
This work proposes a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities, and presents an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage.
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
TLDR
A novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task and significantly outperforms the baselines and achieves the best on the real- world Room-to-Room dataset.
Visual Semantic Planning Using Deep Successor Representations
TLDR
This work addresses the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state, and develops a deep predictive model based on successor representations.
...
1
2
3
4
5
...