RUN through the Streets: A New Dataset and Baseline Models for Realistic Urban Navigation

  title={RUN through the Streets: A New Dataset and Baseline Models for Realistic Urban Navigation},
  author={Tzuf Paz-Argaman and Reut Tsarfaty},
Following navigation instructions in natural language (NL) requires a composition of language, action, and knowledge of the environment. Knowledge of the environment may be provided via visual sensors or as a symbolic world representation referred to as a map. Previous work on map-based NL navigation relied on small artificial worlds with a fixed set of entities known in advance. Here we introduce the Realistic Urban Navigation (RUN) task, aimed at interpreting NL navigation instructions based… 

Figures and Tables from this paper

Sim-to-Real Transfer for Vision-and-Language Navigation

To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, a subgoal model is proposed to identify nearby waypoints, and domain randomization is used to mitigate visual domain differences.

Visual-and-Language Navigation: A Survey and Taxonomy

This paper provides a comprehensive survey on VLN tasks and makes a classification carefully according the different characteristics of language instructions in these tasks, to enable researchers to better grasp the key point of a specific task and identify directions for future research.

Learning to Read Maps: Understanding Natural Language Instructions from Unseen Maps

This paper demonstrates that the gold GPS location can be accurately predicted from the natural language instruction and metadata with 72% accuracy for previously seen maps and 64% for unseen maps.

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

This paper reviews contem-porary studies in the emerging VLN, covering tasks, evaluation metrics, methods, etc, and highlights the lim-itations of current VLn and opportunities for future work.

Vision-Language Navigation: A Survey and Taxonomy

This paper provides a comprehensive survey and an insightful taxonomy of Vision-Language Navigation tasks based on the different characteristics of language instructions in these tasks, dividing the tasks into single-turn and multi-turn tasks.

Draw Me a Flower: Grounding Formal Abstract Structures Stated in Informal Natural Language

Results of the baseline models on an instruction-to-execution task derived from the HEXAGONS dataset confirm that higher-level abstractions in NL are indeed more challenging for current systems to process.

Visually Grounding Language Instruction for History-Dependent Manipulation

  • Hyemin AhnObin Kwon S. Oh
  • Computer Science
    2022 International Conference on Robotics and Automation (ICRA)
  • 2022
A history-dependent manipulation task which objective is to visually ground a series of language instructions for proper pick-and-place manipulations by referring to the past, and it is shown that the model trained with the proposed dataset can also be applied to the real world based on the CycleGAN.



TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

This work introduces the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object.

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

This work introduces a multi-level aligner that empowers the alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) to translate natural language instructions to action sequences based upon a representation of the observable world state.

Interpretation of Spatial Language in a Map Navigation Task

  • M. LevitD. Roy
  • Computer Science
    IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)
  • 2007
This paper provides building blocks for a complete system that, when combined with robust parsing technologies, could lead to a fully automatic spatial language interpretation system.

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

This work provides the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset and presents the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery.

Learning to Follow Navigational Directions

A system that learns to follow navigational natural language directions by learning by apprenticeship from routes through a map paired with English descriptions using a reinforcement learning algorithm, which grounds the meaning of spatial terms like above and south into geometric properties of paths.

Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning

We introduce a method for following high-level navigation instructions by mapping directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time

Talk the Walk: Navigating New York City through Grounded Dialogue

This work focuses on the task of tourist localization and develops the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, and shows it yields significant improvements for both emergent and natural language communication.

Learning to Interpret Natural Language Navigation Instructions from Observations

A system that learns to transform natural-language navigation instructions into executable formal plans by using a learned lexicon to refine inferred plans and a supervised learner to induce a semantic parser.

Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions

MARCO, an agent that follows free-form, natural language route instructions by representing and executing a sequence of compound action specifications that model which actions to take under which conditions, is presented.

Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention

To model language-based assistance, a general framework termed Imitation Learning with Indirect Intervention (I3L) is developed, and a solution that is effective on the VNLA task is proposed that significantly improves the success rate of the learning agent over other baselines on both seen and unseen environments.