Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

@article{Fang2019SceneMT,
  title={Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks},
  author={Kuan Fang and Alexander Toshev and Li Fei-Fei and Silvio Savarese},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019},
  pages={538-547}
}
Many robotic applications require the agent to perform long-horizon tasks in partially observable environments. [] Key Method The proposed policy embeds and adds each observation to a memory and uses the attention mechanism to exploit spatio-temporal dependencies. This model is generic and can be efficiently trained with reinforcement learning over long episodes. On a range of visual navigation tasks, SMT demonstrates superior performance to existing reactive and memory-based policies by a margin.

Figures and Tables from this paper

Learning to Navigate in Interactive Environments with the Transformer-based Memory

TLDR
This work proposes a surrogate objective to predict the next waypoint, which facilitates the representation learning and bootstrap the RL, and shows a significant improvement over the interactive Gibson benchmark and the related recurrent RL policy both in the validation seen scenes and the test unseen scenes.

History Aware Multimodal Transformer for Vision-and-Language Navigation

TLDR
A History Aware Multimodal Transformer (HAMT) is introduced to incorporate a long-horizon history into multimodal decision making for vision-and-language navigation and achieves new state of the art on a broad range of VLN tasks.

Memory-Augmented Reinforcement Learning for Image-Goal Navigation

TLDR
This work presents a memory-augmented approach for image-goal navigation based on an attention-based end-to-end model that leverages an episodic memory to learn to navigate, and establishes a new state of the art on the challenging Gibson dataset.

Structured Scene Memory for Vision-Language Navigation

TLDR
This work proposes a crucial architecture, called Structured Scene Memory (SSM), which is compartmentalized enough to accurately memorize the percepts during navigation and serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.

Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation

TLDR
This work shows that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings, and significantly improves the performance of different baseline agents.

Learning Composable Behavior Embeddings for Long-Horizon Visual Navigation

TLDR
This work proposes Composable Behavior Embedding (CBE), a continuous behavior representation for long-horizon visual navigation that can be used to performing memory-efficient path following and topological mapping, saving more than an order of magnitude of memory than behavior-less approaches.

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

TLDR
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON and improves the success rate on the fine-grained VLN benchmark R2R.

A Few Shot Adaptation of Visual Navigation Skills to New Observations using Meta-Learning

TLDR
This paper designs a policy architecture with latent features between perception and inference networks and quickly adapt the perception network via meta-learning while freezing the inference network and introduces a learning algorithm that enables rapid adaptation to new sensor configurations or target objects with a few shots.

MemoNav: Selecting Informative Memories for Visual Navigation

TLDR
The experimental results show that the MemoNav outperforms the SoTA methods by a large margin with a smaller fraction of navigation history, and empirically shows that the model is less likely to be trapped in a deadlock, which further validates that theMemoNav improves the agent’s navigation efficiency by reducing redundant steps.

Learning Object-conditioned Exploration using Distributed Soft Actor Critic

TLDR
This work presents a highly scalable implementation of an off-policy Reinforcement Learning algorithm, distributed Soft Actor Critic, which allows the system to utilize 98M experience steps in 24 hours on 8 GPUs, and learns to control a differential drive mobile base in simulation from a stack of high dimensional observations commonly used on robotic platforms.
...

References

SHOWING 1-10 OF 56 REFERENCES

Visual Semantic Planning Using Deep Successor Representations

TLDR
This work addresses the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state, and develops a deep predictive model based on successor representations.

Memory Augmented Control Networks

TLDR
It is shown that the Memory Augmented Control Network learns to plan and can generalize to new environments and is evaluated in discrete grid world environments for path planning in the presence of simple and complex obstacles.

Target-driven visual navigation in indoor scenes using deep reinforcement learning

TLDR
This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine.

Neural SLAM: Learning to Explore with External Memory

TLDR
This work embeds procedures mimicking that of traditional Simultaneous Localization and Mapping (SLAM) into the soft attention based addressing of external memory architectures, in which the external memory acts as an internal representation of the environment.

Control of Memory, Active Perception, and Action in Minecraft

TLDR
These tasks are designed to emphasize, in a controllable manner, issues that pose challenges for RL methods including partial observability, delayed rewards, high-dimensional visual observations, and the need to use active perception in a correct manner so as to perform well in the tasks.

Neural Map: Structured Memory for Deep Reinforcement Learning

TLDR
This paper develops a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with and demonstrates empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments.

Neural SLAM

TLDR
This work embeds procedures mimicking that of traditional Simultaneous Localization and Mapping (SLAM) into the soft attention based addressing of external memory architectures, in which the external memory acts as an internal representation of the environment.

Cognitive Mapping and Planning for Visual Navigation

TLDR
The Cognitive Mapper and Planner is based on a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and a spatial memory with the ability to plan given an incomplete set of observations about the world.

Learning to Navigate in Complex Environments

TLDR
This work considers jointly learning the goal-driven reinforcement learning problem with auxiliary depth prediction and loop closure classification tasks and shows that data efficiency and task performance can be dramatically improved by relying on additional auxiliary tasks leveraging multimodal sensory inputs.

Visual Representations for Semantic Target Driven Navigation

TLDR
This work proposes to use semantic segmentation and detection masks as observations obtained by state-of-the-art computer vision algorithms and use a deep network to learn navigation policies on top of representations that capture spatial layout and semantic contextual cues.
...