• Corpus ID: 31442909

Approximately Optimal Approximate Reinforcement Learning

  title={Approximately Optimal Approximate Reinforcement Learning},
  author={Sham M. Kakade and John Langford},
Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach
This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy–iteration algorithms. When either the policy evaluation step or the policy improvement
You May Not Need Ratio Clipping in PPO
This paper compares ESPO with PPO across many continuous control tasks and shows that ESPO significantly outperforms PPO, and can be easily scaled up to distributed training with many workers, delivering strong performance as well.
On the Global Optimum Convergence of Momentum-based Policy Gradient
Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning due to their relative stability and in-cremental nature. In recent years, the empirical success of PG
Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences
This work investigates approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values, and shows that the reverse KL has stronger policy improvement guarantees, and that reducing the forward KL can result in a worse policy.
Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Reinforcement Learning
A new reinforcement learning algorithm called Short Horizon Policy Improvement (SHPI) is developed that approximates policy-induced drift in user behavior across sessions and can outperform state-of-the-art recommendation techniques like matrix factorization with offline proxy signals, bandits with myopic online proxies, and RL baselines with limited amounts of user interaction.
Stable Policy Optimization via Off-Policy Divergence Regularization
This paper revisits the theoretical foundations of TRPO and PPO and proposes a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis, and proves fast rates of Õ(1/N), much like results in convex optimization.
Neural Policy Gradient Methods: Global Optimality and Rates of Convergence
This analysis establishes the first global optimality and convergence guarantees for neural policy gradient methods by relating the suboptimality of the stationary points to the representation power of neural actor and critic classes and proving the global optimability of all stationary points under mild regularity conditions.
Risk-Averse Trust Region Optimization for Reward-Volatility Reduction
A novel measure of risk is defined, which is called reward volatility, consisting of the variance of the rewards under the state-occupancy measure, and is shown to bound the return variance so that reducing the former also constrains the latter.
Policy Optimization Through Approximated Importance Sampling
This paper derives an alternative objective that obtains the value of the target policy by applying importance sampling (IS) and develops a practical algorithm that improves upon state of the art on-policy policy optimization on continuous control benchmarks.