Visual Semantic Planning Using Deep Successor Representations

  title={Visual Semantic Planning Using Deep Successor Representations},
  author={Yuke Zhu and Daniel Gordon and Eric Kolve and Dieter Fox and Li Fei-Fei and Abhinav Kumar Gupta and Roozbeh Mottaghi and Ali Farhadi},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world. In this work, we address the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state. Doing so entails knowledge about objects and their affordances, as well as actions and their preconditions and effects. We propose learning… 

Figures and Tables from this paper

What Should I Do Now? Marrying Reinforcement Learning and Symbolic Planning
HIP-RL is proposed, a method for merging the benefits and capabilities of Symbolic Planning with the learning abilities of Deep Reinforcement Learning, and applied to the complex visual tasks of interactive question answering and visual semantic planning.
Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks
A novel memory-based policy, named Scene Memory Transformer (SMT), which embeds and adds each observation to a memory and uses the attention mechanism to exploit spatio-temporal dependencies.
Visual Search and Recognition for Robot Task Execution and Monitoring
A vision-based execution monitoring, which uses classical planning as a backbone for task execution and takes advantage of a deep convolutional network to detect objects and relevant relations holding between them is introduced.
Explore, Approach, and Terminate: Evaluating Subtasks in Active Visual Object Search Based on Deep Reinforcement Learning
This work proposes a reinforcement learning solution to the active visual object search problem that successfully learns to explore the environment, to approach the target object, and to decide when to terminate the search as thetarget object has been found.
Deep Reinforcement Learning for Visual Semantic Navigation with Memory
  • I. B. D. A. Santos, R. Romero
  • Computer Science
    2020 Latin American Robotics Symposium (LARS), 2020 Brazilian Symposium on Robotics (SBR) and 2020 Workshop on Robotics in Education (WRE)
  • 2020
The effects of adding Recurrent Networks on a learning-based navigation model are investigated, making possible the learning of better policies with the use of memory from past experiences when compared to models without memory.
Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions
This work empirically demonstrates that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases, and suggests contextualized language models may provide strong planning modules for grounded virtual agents.
Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships
This paper investigates the target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes, whose navigation task aims to train an agent that can intelligently make a series of decisions to arrive at a pre-specified target location from any possible starting positions only based on egocentric views.
Towards Generalization in Target-Driven Visual Navigation by Using Deep Reinforcement Learning
This article proposes a novel architecture composed of two networks, both exclusively trained in simulation, specifically designed to work together, while separately trained to help generalization in target-driven visual navigation.
Deep Learning for Embodied Vision Navigation: A Survey
This paper presents a comprehensive review of embodied navigation tasks and the recent progress in deep learning based methods, which includes two major tasks: target-oriented navigation and the instruction-oriented Navigation.
Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following
A model factorizing interactive perception and action policy in separate streams in a unified end-to-end framework is designed, which outperforms the previous challenge winner method.


Target-driven visual navigation in indoor scenes using deep reinforcement learning
This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine.
The Curious Robot: Learning Visual Representations via Physical Interactions
This work builds one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations.
Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs
This work introduces an extension to the Generative Adversarial Imitation Learning method that can infer the latent structure of human decision-making in an unsupervised way and can not only imitate complex behaviors, but also learn interpretable and meaningful representations.
Actions ~ Transformations
A novel representation for actions is proposed by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect).
Learning to Poke by Poking: Experiential Learning of Intuitive Physics
A novel approach based on deep neural networks is proposed for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics.
End-to-End Training of Deep Visuomotor Policies
This paper develops a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors, trained using a partially observed guided policy search method, with supervision provided by a simple trajectory-centric reinforcement learning method.
Continuous control with deep reinforcement learning
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Learning to Act by Predicting the Future
The presented approach utilizes a high-dimensional sensory stream and a lower-dimensional measurement stream that provides a rich supervisory signal, which enables training a sensorimotor control model by interacting with the environment.
Learning to Perform Physics Experiments via Deep Reinforcement Learning
This work introduces a basic set of tasks that require agents to estimate properties such as mass and cohesion of objects in an interactive simulated environment where they can manipulate the objects and observe the consequences.
Combined task and motion planning through an extensible planner-independent interface layer
This work proposes a new approach that uses off-the-shelf task planners and motion planners and makes no assumptions about their implementation and uses a novel representational abstraction that requires only that failures in computing a motion plan for a high-level action be identifiable and expressible in the form of logical predicates at the task level.