Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space

  title={Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space},
  author={Kuan Fang and Patrick Yin and Ashvin Nair and Sergey Levine},
—General-purpose robots require diverse reper- toires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal- conditioned reinforcement learning aims to acquire policies that can reach configurable goals for a wide range of tasks on command. However, such goal-conditioned policies are notoriously difficult and time-consuming to train from scratch. In this paper, we propose Planning to Practice (PTP), a method that makes it practical to… 

Figures from this paper


C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks
An algorithm to solve the distant goal-reaching task by using search at training time to automatically generate a curriculum of intermediate states and is able to solve very long horizons manipulation and navigation tasks, tasks that prior goalconditioned methods and methods based on graph search fail to solve.
Goal-Conditioned Reinforcement Learning with Imagined Subgoals
This work proposes to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks and evaluates its approach on complex robotic navigation and manipulation tasks and shows that it outperforms existing methods by a large margin.
PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals
This work proposes PlanGAN, a model-based algorithm specifically designed for solving multi-goal tasks in environments with sparse rewards, and indicates that it can achieve comparable performance whilst being around 4-8 times more sample efficient.
Planning with Goal-Conditioned Policies
This work shows that goal-conditioned policies learned with RL can be incorporated into planning, such that a planner can focus on which states to reach, rather than how those states are reached, and proposes using a latent variable model to compactly represent the set of valid states.
Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation
A framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning, and observes that the method naturally identifies semantically meaningful states as subgoals.
Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills
This work proposes the objective of learning a functional understanding of the environment by learning to reach any goal state in a given dataset by employing goal-conditioned Q-learning with hindsight relabeling and develops several techniques that enable training in a particularly challenging offline setting.
Visual Reinforcement Learning with Imagined Goals
An algorithm is proposed that acquires general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies, efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques.
Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning
This work simplifies the long-horizon policy learning problem by using a novel data-relabeling algorithm for learning goal-conditioned hierarchical policies, where the low-level only acts for a fixed number of steps, regardless of the goal achieved.
Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors
By using both goal-conditioning and hierarchical prediction, GCPs enable us to solve visual planning tasks with much longer horizon than previously possible and enable an effective hierarchical planning algorithm that optimizes trajectories in a coarse-to-fine manner.
Learning to Reach Goals via Iterated Supervised Learning
This paper proposes a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal- reaching behaviors from scratch, and formally shows that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrates improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks.