• Corpus ID: 231986261

On Proximal Policy Optimization's Heavy-tailed Gradients

  title={On Proximal Policy Optimization's Heavy-tailed Gradients},
  author={Saurabh Kumar Garg and Joshua Zhanson and Emilio Parisotto and Adarsh Prasad and J. Zico Kolter and Sivaraman Balakrishnan and Zachary Chase Lipton and Ruslan Salakhutdinov and Pradeep Ravikumar},
Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich (“heavy-tailed”) regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that… 
On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control
How the convergence rate to stationarity depends on the policy’s tail index α, a Hölder continuity parameter, integrability conditions, and an exploration tolerance parameter are established here for the first time and corroborate also manifests in improved performance of policy search.
On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces
The convergence of policy gradient algorithms under heavy-tailed parameterizations is studied, which is proposed to stabilize with a combination of mirror ascent-type updates and gradient tracking and yields improved reward accumulation across a variety of settings as compared with standard benchmarks.
Heavy-tailed Streaming Statistical Estimation
A clipped stochastic gradient descent algorithm is designed and an improved analysis is provided, under a more nuanced condition on the noise of the Stochastic gradients, which is critical when analyzing stochastically optimization problems arising from general statistical estimation problems.
Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
Real data experiments on deep learning confirm the theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter"local minima and achieves better generalization, and characterize the dynamics of truncated SGD driven byheavy-tailed noises.
HTRON: Efficient Outdoor Navigation with Sparse Rewards via Heavy Tailed Adaptive Reinforce Algorithm
This work proposes a novel adaptive HTRON algorithm, which is to utilize heavy-tailed policy parametrizations which implicitly induce exploration in sparse reward settings and can be transferred directly into a Clearpath Husky robot to perform outdoor terrain navigation in real-world scenarios.


A Closer Look at Deep Policy Gradients
A fine-grained analysis of state-of-the-art methods based on key elements of this framework: gradient estimation, value prediction, and optimization landscapes shows that the behavior of deep policy gradient algorithms often deviates from what their motivating framework would predict.
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
© ICLR 2019 - Conference Track Proceedings. All rights reserved. Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
The experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks, and the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.
Beyond variance reduction: Understanding the true impact of baselines on policy optimization
It is found that baselines modify the optimization dynamics even when the variance is the same, and a more careful treatment of stochasticity in the updates---beyond the immediate variance---is necessary to understand the optimization process of policy gradient algorithms.
Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods
An empirical analysis of the effects that a wide range of gradient descent optimizers and their hyperparameters have on policy gradient methods, a subset of Deep RL algorithms, for benchmark continuous control tasks finds that adaptive optimizers have a narrow window of effective learning rates, diverging in other cases, and that the effectiveness of momentum varies depending on the properties of the environment.
Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods
This work analyzes the properties and drawbacks of previous CV techniques and finds that an important fact that Monte Carlo gradient estimates are generated by trajectories of states and actions are overlooked, and proposes a class of "trajectory-wise" CVs, which are optimal for variance reduction under reasonable assumptions.
Implementation Matters in Deep RL: A Case Study on PPO and TRPO
The results show that algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm are responsible for most of PPO's gain in cumulative reward over TRPO, and fundamentally change how RL methods function.
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise
It is argued both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, and a novel framework is developed, which is coined as ULD (FULD), and it is proved that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost.
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
The significance of hyper-parameters in policy gradients for continuous control, general variance in the algorithms, and reproducibility of reported results are investigated and the guidelines on reporting novel results as comparisons against baseline methods are provided.