# Bayesian Policy Gradients via Alpha Divergence Dropout Inference

@article{Henderson2017BayesianPG, title={Bayesian Policy Gradients via Alpha Divergence Dropout Inference}, author={Peter Henderson and Thang Van Doan and Riashat Islam and David Meger}, journal={ArXiv}, year={2017}, volume={abs/1712.02037} }

Policy gradient methods have had great success in solving continuous control tasks, yet the stochastic nature of such problems makes deterministic value estimation difficult. We propose an approach which instead estimates a distribution by fitting the value function with a Bayesian Neural Network. We optimize an $\alpha$-divergence objective with Bayesian dropout approximation to learn and estimate this distribution. We show that using the Monte Carlo posterior mean of the Bayesian value…

## Figures and Tables from this paper

## 14 Citations

NADPE X : A N ON-POLICY TEMPORALLY CONSISTENT EXPLORATION METHOD FOR DEEP REINFORCEMENT LEARNING

- Computer Science
- 2018

This work introduces a novel on-policy temporally consistent exploration strategy Neural Adaptive Dropout Policy Exploration (NADPEx) for deep reinforcement learning agents, models as a global random variable for conditional distribution, equipping them with inherent temporal consistency, even when the reward signals are sparse.

NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning

- Computer ScienceICLR
- 2019

This work introduces a novel on-policy temporally consistent exploration strategy - Neural Adaptive Dropout Policy Exploration (NADPEx) - for deep reinforcement learning agents, modeled as a global random variable for conditional distribution.

Reward Estimation for Variance Reduction in Deep Reinforcement Learning

- Computer ScienceCoRL
- 2018

The use of reward estimation is a robust and easy-to-implement improvement for handling corrupted reward signals in model-free RL and improves performance under corrupted stochastic rewards in both the tabular and non-linear function approximation settings.

Exploration by Distributional Reinforcement Learning

- Computer ScienceIJCAI
- 2018

We propose a framework based on distributional reinforcement learning and recent attempts to combine Bayesian parameter updates with deep reinforcement learning. We show that our proposed framework…

The Potential of the Return Distribution for Exploration in RL

- Computer ScienceArXiv
- 2018

Combined with exploration policies that leverage this return distribution, this paper solves, for example, a randomized Chain task of length 100, which has not been reported before when learning with neural networks.

Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning

- Computer ScienceArXiv
- 2019

This work presents a novel approach to population-based RL in continuous control that leverages properties of normalizing flows to perform attractive and repulsive operations between current members of the population and previously observed policies.

Deep Reinforcement Learning: Frontiers of Artificial Intelligence

- Computer ScienceSpringer Singapore
- 2019

TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions, and ATreeC, an actor-critic variant that augments TreeQN with a softmax layer to form a stochastic policy network.

A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges

- Computer ScienceInf. Fusion
- 2021

Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter

- Computer Science2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
- 2020

A novel Deep RL procedure is presented that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects.

Uncertainty quantification in skin cancer classification using three-way decision-based Bayesian deep learning

- Computer ScienceComput. Biol. Medicine
- 2021

## References

SHOWING 1-10 OF 20 REFERENCES

Bayesian Policy Gradient and Actor-Critic Algorithms

- Computer ScienceJ. Mach. Learn. Res.
- 2016

A Bayesian framework for policy gradient is proposed, based on modeling the policy gradient as a Gaussian process, which reduces the number of samples needed to obtain accurate gradient estimates and provides estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely, the gradient covariance.

Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Computer ScienceNIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Proximal Policy Optimization Algorithms

- Computer ScienceArXiv
- 2017

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective…

Trust Region Policy Optimization

- Computer ScienceICML
- 2015

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

High-Dimensional Continuous Control Using Generalized Advantage Estimation

- Computer ScienceICLR
- 2016

This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.

Improving PILCO with Bayesian Neural Network Dynamics Models

- Computer Science
- 2016

PILCO’s framework is extended to use Bayesian deep dynamics models with approximate variational inference, allowing PILCO to scale linearly with number of trials and observation space dimensionality, and it is shown that moment matching is a crucial simplifying assumption made by the model.

Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks

- Computer ScienceICML
- 2015

This work presents a novel scalable method for learning Bayesian neural networks, called probabilistic backpropagation (PBP), which works by computing a forward propagation of probabilities through the network and then doing a backward computation of gradients.

A Distributional Perspective on Reinforcement Learning

- Computer ScienceICML
- 2017

This paper argues for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent, and designs a new algorithm which applies Bellman's equation to the learning of approximate value distributions.

Issues in Using Function Approximation for Reinforcement Learning

- Computer Science
- 1999

This paper gives a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail, and presents experimental results which support the theoretical findings.

Concrete Dropout

- Computer ScienceNIPS
- 2017

This work proposes a new dropout variant which gives improved performance and better calibrated uncertainties, and uses a continuous relaxation of dropout’s discrete masks to allow for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles.