• Corpus ID: 28172764

Bayesian Policy Gradients via Alpha Divergence Dropout Inference

  title={Bayesian Policy Gradients via Alpha Divergence Dropout Inference},
  author={Peter Henderson and Thang Van Doan and Riashat Islam and David Meger},
Policy gradient methods have had great success in solving continuous control tasks, yet the stochastic nature of such problems makes deterministic value estimation difficult. We propose an approach which instead estimates a distribution by fitting the value function with a Bayesian Neural Network. We optimize an $\alpha$-divergence objective with Bayesian dropout approximation to learn and estimate this distribution. We show that using the Monte Carlo posterior mean of the Bayesian value… 

Figures and Tables from this paper

This work introduces a novel on-policy temporally consistent exploration strategy Neural Adaptive Dropout Policy Exploration (NADPEx) for deep reinforcement learning agents, models as a global random variable for conditional distribution, equipping them with inherent temporal consistency, even when the reward signals are sparse.
NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning
This work introduces a novel on-policy temporally consistent exploration strategy - Neural Adaptive Dropout Policy Exploration (NADPEx) - for deep reinforcement learning agents, modeled as a global random variable for conditional distribution.
Reward Estimation for Variance Reduction in Deep Reinforcement Learning
The use of reward estimation is a robust and easy-to-implement improvement for handling corrupted reward signals in model-free RL and improves performance under corrupted stochastic rewards in both the tabular and non-linear function approximation settings.
Exploration by Distributional Reinforcement Learning
We propose a framework based on distributional reinforcement learning and recent attempts to combine Bayesian parameter updates with deep reinforcement learning. We show that our proposed framework
The Potential of the Return Distribution for Exploration in RL
Combined with exploration policies that leverage this return distribution, this paper solves, for example, a randomized Chain task of length 100, which has not been reported before when learning with neural networks.
Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning
This work presents a novel approach to population-based RL in continuous control that leverages properties of normalizing flows to perform attractive and repulsive operations between current members of the population and previously observed policies.
Deep Reinforcement Learning: Frontiers of Artificial Intelligence
TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions, and ATreeC, an actor-critic variant that augments TreeQN with a softmax layer to form a stochastic policy network.
Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter
A novel Deep RL procedure is presented that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects.


Bayesian Policy Gradient and Actor-Critic Algorithms
A Bayesian framework for policy gradient is proposed, based on modeling the policy gradient as a Gaussian process, which reduces the number of samples needed to obtain accurate gradient estimates and provides estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely, the gradient covariance.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Trust Region Policy Optimization
A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).
High-Dimensional Continuous Control Using Generalized Advantage Estimation
This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.
Improving PILCO with Bayesian Neural Network Dynamics Models
PILCO’s framework is extended to use Bayesian deep dynamics models with approximate variational inference, allowing PILCO to scale linearly with number of trials and observation space dimensionality, and it is shown that moment matching is a crucial simplifying assumption made by the model.
Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks
This work presents a novel scalable method for learning Bayesian neural networks, called probabilistic backpropagation (PBP), which works by computing a forward propagation of probabilities through the network and then doing a backward computation of gradients.
A Distributional Perspective on Reinforcement Learning
This paper argues for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent, and designs a new algorithm which applies Bellman's equation to the learning of approximate value distributions.
Issues in Using Function Approximation for Reinforcement Learning
This paper gives a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail, and presents experimental results which support the theoretical findings.
Concrete Dropout
This work proposes a new dropout variant which gives improved performance and better calibrated uncertainties, and uses a continuous relaxation of dropout’s discrete masks to allow for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles.