Corpus ID: 237303790

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

  title={SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments},
  author={Muhammad Zubair Irshad and Niluthpol Chowdhury Mithun and Zachary Seymour and Han-Pang Chiu and Supun Samarasekera and Rakesh Kumar},
This paper presents a novel approach for the Vision-andLanguage Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid… Expand

Figures and Tables from this paper


Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks
AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Expand
Structured Scene Memory for Vision-Language Navigation
This work proposes a crucial architecture, called Structured Scene Memory (SSM), which is compartmentalized enough to accurately memorize the percepts during navigation and serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment. Expand
Multi-modal Discriminative Model for Vision-and-Language Navigation
This study develops a discriminator that evaluates how well an instruction explains a given path in VLN task using multi-modal alignment and reveals that only a small fraction of the high-quality augmented data from Fried et al., as scored by the discriminator, is useful for training VLn agents with similar performance. Expand
MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation
This work proposes a method to encode vital scene semantics such as traversable paths, unexplored areas, and observed scene objects–alongside raw visual streams such as RGB, depth, and semantic segmentation masks—into a semantically informed, top-down egocentric map representation and introduces a novel 2-D map attention mechanism. Expand
Cognitive Mapping and Planning for Visual Navigation
The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world. Expand
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training
This paper presents the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks, which leads to significant improvement over existing methods, achieving a new state of the art. Expand
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
The Evolving Graphical Planner (EGP) is introduced, a model that performs global planning for navigation based on raw sensory input that dynamically constructs a graphical representation, generalizes the action space to allow for more flexible decision making, and performs efficient planning on a proxy graph representation. Expand
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
A language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions is developed, suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions. Expand
Topological Planning with Transformers for Vision-and-Language Navigation
This work proposes a modular approach to VLN using topological maps that leverages attention mechanisms to predict a navigation plan in the map, and generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking. Expand
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery. Expand