Corpus ID: 221376540

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

  title={Beyond variance reduction: Understanding the true impact of baselines on policy optimization},
  author={Wesley Chung and Valentin Thomas and Marlos C. Machado and Nicolas Le Roux},
Policy gradients methods are a popular and effective choice to train reinforcement learning agents in complex environments. The variance of the stochastic policy gradient is often seen as a key quantity to determine the effectiveness of the algorithm. Baselines are a common addition to reduce the variance of the gradient, but previous works have hardly ever considered other effects baselines may have on the optimization process. Using simple examples, we find that baselines modify the… Expand
Coordinate-wise Control Variates for Deep Policy Gradients
Experimental evidence suggesting that lower variance can be obtained with such baselines than with the conventional scalarvalued baseline is presented, and the resulting algorithm with proper regularization can achieve higher sample efficiency than scalar control variates in continuous control benchmarks. Expand
Knowledge Infused Policy Gradients with Upper Confidence Bound for Relational Bandits
This work proposes an adaptation of Knowledge Infused Policy Gradients to the Contextual Bandit setting and a novel Knowledge Infusions Upper Confidence Bound algorithm and performs an experimental analysis of a simulated music recommendation dataset and various real-life datasets where expert knowledge can drastically reduce the total regret and where it cannot. Expand
On Proximal Policy Optimization's Heavy-tailed Gradients
This paper proposes incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks, and finds that this method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks. Expand


Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
The experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks, and the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks. Expand
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
This paper considers variance reduction methods that were developed for Monte Carlo estimates of integrals, and gives bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. Expand
Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods
This work analyzes the properties and drawbacks of previous CV techniques and finds that an important fact that Monte Carlo gradient estimates are generated by trajectories of states and actions are overlooked, and proposes a class of "trajectory-wise" CVs, which are optimal for variance reduction under reasonable assumptions. Expand
The Optimal Reward Baseline for Gradient-Based Reinforcement Learning
This work incorporates a reward baseline into the learning system, and shows that it affects variance without introducing further bias, and finds the optimal constant reward baseline is equal to the long-term average expected reward. Expand
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning
This work introduces the the Policy Cover-Policy Gradient algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover) and complements the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings. Expand
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
© ICLR 2019 - Conference Track Proceedings. All rights reserved. Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a majorExpand
Neural Replicator Dynamics: Multiagent Learning via Hedging Policy Gradients
An elegant one-line change to policy gradient methods is derived that simply bypasses the gradient step through the softmax, yielding a new algorithm titled Neural Replicator Dynamics (NeuRD), which quickly adapts to nonstationarities and outperforms policy gradient significantly in both tabular and function approximation settings. Expand
Eligibility Traces for Off-Policy Policy Evaluation
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Expand
Natural actor-critic algorithms
Four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas are presented, and their convergence proofs are provided, providing the first convergence proofs and the first fully incremental algorithms. Expand
Safe and Efficient Off-Policy Reinforcement Learning
A novel algorithm, Retrace ($\lambda$), is derived, believed to be the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). Expand