• Corpus ID: 54436421

Discovering hierarchies using Imitation Learning from hierarchy aware policies

  title={Discovering hierarchies using Imitation Learning from hierarchy aware policies},
  author={A. Deshpande and K HarshavardhanP. and Balaraman Ravindran},
Learning options that allow agents to exhibit temporally higher order behavior has proven to be useful in increasing exploration, reducing sample complexity and for various transfer scenarios. Deep Discovery of Options (DDO) is a generative algorithm that learns a hierarchical policy along with options directly from expert trajectories. We perform a qualitative and quantitative analysis of options inferred from DDO in different domains. To this end, we suggest different value metrics like… 


Multi-Level Discovery of Deep Options
Discovery of Deep Options (DDO), a policy-gradient algorithm that discovers parametrized options from a set of demonstration trajectories, and can be used recursively to discover additional levels of the hierarchy, is presented.
Hierarchical Imitation and Reinforcement Learning
This work proposes an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem to integrate different modes of expert interaction and can incorporate different combinations of imitation learning and reinforcement learning at different levels, leading to dramatic reductions in both expert effort and cost of exploration.
Intra-Option Learning about Temporally Abstract Actions
This paper presents intra-option learning methods for learning value functions over options and for learning multi-time models of the consequences of options and sketches a convergence proof for intraoption value learning.
An Inference-Based Policy Gradient Method for Learning Options
This work develops a novel policy gradient method for the automatic learning of policies with options that uses inference methods to simultaneously improve all of the options available to an agent, and thus can be employed in an off-policy manner, without observing option labels.
Recent Advances in Hierarchical Reinforcement Learning
This work reviews several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed and discusses extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability.
The Option-Critic Architecture
This work derives policy gradient theorems for options and proposes a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals.
DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations
Results suggest that DDCO can take 3x fewer demonstrations to achieve the same reward compared to a baseline imitation learning approach, and a cross-validation method that relaxes DDO's requirement that users specify the number of options to be discovered.
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning
It is shown that options enable temporally abstract knowledge and action to be included in the reinforcement learning frame- work in a natural and general way and may be used interchangeably with primitive actions in planning methods such as dynamic pro- gramming and in learning methodssuch as Q-learning.
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
h-DQN is presented, a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning, and allows for flexible goal specifications, such as functions over entities and relations.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
This paper proposes a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting and demonstrates that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.