On the Role of Weight Sharing During Deep Option Learning

@inproceedings{Riemer2020OnTR,
  title={On the Role of Weight Sharing During Deep Option Learning},
  author={Matthew Riemer and Ignacio Cases and Clemens Rosenbaum and Miao Liu and Gerald Tesauro},
  booktitle={AAAI},
  year={2020}
}
The options framework is a popular approach for building temporally extended actions in reinforcement learning. In particular, the option-critic architecture provides general purpose policy gradient theorems for learning actions from scratch that are extended in time. However, past work makes the key assumption that each of the components of option-critic has independent parameters. In this work we note that while this key assumption of the policy gradient theorems of option-critic holds in the… 
Context-Specific Representation Abstraction for Deep Option Learning
TLDR
This paper introduces Context-Specific Representation Abstraction for Deep Option Learning (CRADOL), a new framework that considers both temporal abstraction and context-specific representation abstraction to effectively reduce the size of the search over policy space.
Hierarchical Average Reward Policy Gradient Algorithms
TLDR
This work extends the hierarchical option-critic policy gradient theorem for the average reward criterion and proves that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one.
Hierarchical Average Reward Policy Gradient Algorithms (Student Abstract)
TLDR
This work extends the hierarchical option-critic policy gradient theorem for the average reward criterion and proves that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one.
Continual Learning In Environments With Polynomial Mixing Times
TLDR
It is established that scalable MDPs have mixing times that scale polynomially with the size of the problem, and a family of model-based algorithms that speed up learning by directly optimizing for the average reward through a novel bootstrapping procedure is proposed.
Parameter Sharing in Coagent Networks
TLDR
The theorem that generalizes the Coagent Network Policy Gradient Theorem to the context where parameters are shared among the function approximators involved provides the theoretical foundation to use any pattern of parameter sharing and leverage the freedom in the graph structure of the network to possibility exploit relational bias in a given task.
Towards Continual Reinforcement Learning: A Review and Perspectives
TLDR
A taxonomy of different continual RL formulations and mathematically characterize the non-stationary dynamics of each setting is provided, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance.
Reinforcement Learning from a Mixture of Interpretable Experts
TLDR
The main technical contribution of the paper is to address the challenges introduced by this non-differentiable prototypical state selection procedure and show that the proposed algorithm can learn compelling policies on continuous action deep RL benchmarks, matching the performance of neural network based policies, but returning policies that are more amenable to human inspection than neural network or linear-in-feature policies.

References

SHOWING 1-10 OF 58 REFERENCES
Learning Abstract Options
TLDR
This work extends results from (Bacon et al., 2017) and derive policy gradient theorems for a deep hierarchy of options and proposes a hierarchical option-critic architecture capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals.
Unified Inter and Intra Options Learning Using Policy Gradient Methods
TLDR
This paper proposes a modular parameterization of intra-option policies together with option termination conditions and the option selection policy (inter options), and shows that these three decision components may be viewed as a unified policy over an augmented state-action space, to which standard policy gradient algorithms may be applied.
Conditional Computation in Neural Networks for faster models
TLDR
This paper applies a policy gradient algorithm for learning policies that optimize this loss function and proposes a regularization mechanism that encourages diversification of the dropout policy and presents encouraging empirical results showing that this approach improves the speed of computation without impacting the quality of the approximation.
The Option-Critic Architecture
TLDR
This work derives policy gradient theorems for options and proposes a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals.
Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference
TLDR
This work proposes a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples, and introduces a new algorithm, Meta-Experience Replay, that directly exploits this view by combining experience replay with optimization based meta-learning.
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning
TLDR
It is shown that options enable temporally abstract knowledge and action to be included in the reinforcement learning frame- work in a natural and general way and may be used interchangeably with primitive actions in planning methods such as dynamic pro- gramming and in learning methodssuch as Q-learning.
Eligibility Traces for Off-Policy Policy Evaluation
TLDR
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
Data-Efficient Hierarchical Reinforcement Learning
TLDR
This paper studies how to develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control.
Learning from Scarce Experience
TLDR
A family of algorithms based on likelihood ratio estimation that use data gathered when executing one policy (or collection of policies) to estimate the value of a different policy and show positive empirical results and provide the sample complexity bound.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
...
1
2
3
4
5
...