Monte Carlo Tree Search for Policy Optimization

  title={Monte Carlo Tree Search for Policy Optimization},
  author={Xiaobai Ma and K. Driggs-Campbell and Zongzhang Zhang and Mykel J. Kochenderfer},
Gradient-based methods are often used for policy optimization in deep reinforcement learning, despite being vulnerable to local optima and saddle points. Although gradient-free methods (e.g., genetic algorithms or evolution strategies) help mitigate these issues, poor initialization and local optima are still concerns in highly nonconvex spaces. This paper presents a method for policy optimization based on Monte-Carlo tree search and gradient-free optimization. Our method, called Monte-Carlo… 
Local Search for Policy Iteration in Continuous Control
An algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework and introduces a form of tree search for continuous action spaces.


Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Evolution-Guided Policy Gradient in Reinforcement Learning
Evolutionary Reinforcement Learning (ERL), a hybrid algorithm that leverages the population of an EA to provide diversified data to train an RL agent, and reinserts the RL agent into theEA population periodically to inject gradient information into the EA.
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
This work explores the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients, and highlights several advantages of ES as a blackbox optimization technique.
Trust Region Policy Optimization
A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).
Progressive Strategies for Monte-Carlo Tree Search
Two progressive strategies for MCTS are introduced, called progressive bias and progressive unpruning, which enable the use of relatively time-expensive heuristic knowledge without speed reduction.
Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning
It is shown that combining DNNs with novelty search, which was designed to encourage exploration on tasks with deceptive or sparse reward functions, can solve a high-dimensional problem on which reward-maximizing algorithms fail, and expands the sense of the scale at which GAs can operate.
High-Dimensional Continuous Control Using Generalized Advantage Estimation
This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
This work proposes to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature with trust region, which is the first scalable trust region natural gradient method for actor-critic methods.
Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari
It is demonstrated that even a very basic canonical ES algorithm can achieve the same or even better performance than traditional RL algorithms, and is likely to lead to new advances in the state-of-the-art for solving RL problems.
Safe mutations for deep and recurrent neural networks through output gradients
A family of safe mutation (SM) operators that facilitate exploration without dramatically altering network behavior or requiring additional interaction with the environment are proposed, which dramatically increases the ability of a simple genetic algorithm-based neuroevolution method to find solutions in high-dimensional domains that require deep and/or recurrent neural networks.