• Corpus ID: 54448007

Relative Entropy Regularized Policy Iteration

@article{Abdolmaleki2018RelativeER,
  title={Relative Entropy Regularized Policy Iteration},
  author={Abbas Abdolmaleki and Jost Tobias Springenberg and Jonas Degrave and Steven Bohez and Yuval Tassa and Dan Belov and Nicolas Manfred Otto Heess and Martin A. Riedmiller},
  journal={ArXiv},
  year={2018},
  volume={abs/1812.02256}
}
We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways… 
Modified Actor-Critics
TLDR
This paper proposes to combine (any kind of) soft greediness with Modified Policy Iteration (MPI), and instantiates this framework with the PPO soft greedness, and shows that it is competitive with the state-of-art off-policy algorithm SAC.
Q-Learning for Continuous Actions with Cross-Entropy Guided Policies
TLDR
This work proposes a novel approach, called Cross-Entropy Guided Policies, or CGP, that aims to combine the stability and performance of iterative sampling policies with the low computational cost of a policy network.
Zeroth-Order Supervised Policy Improvement
TLDR
It is proved that with a good function structure, the zeroth-order optimization strategy combining both local and global samplings can find the global minima within a polynomial number of samples.
V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
TLDR
V-MPO is introduced, an on-policy adaptation of Maximum a Posteriori Policy Optimization that performs policy iteration based on a learned state-value function and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters.
CONTINUOUS CONTROL
TLDR
V-MPO is introduced, an on-policy adaptation of Maximum a Posteriori Policy Optimization that performs policy iteration based on a learned statevalue function and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters.
Constrained Variational Policy Optimization for Safe Reinforcement Learning
TLDR
A novel Expectation-Maximization approach to naturally incorporate constraints during the policy learning so that a provable optimal non-parametric variational distribution could be computed in closed form after a convex optimization (E-step).
Revisiting Gaussian mixture critics in off-policy reinforcement learning: a sample-based approach
TLDR
A natural alternative that removes the need for distributional hyperparameters and achieves state-of-the-art performance on a variety of challenging tasks (e.g. the humanoid, dog, quadruped, and manipulator domains) is revisited and an implementation is provided in the Acme agent repository.
Importance Weighted Policy Learning and Adaption
TLDR
A complementary approach to meta reinforcement learning which is conceptually simple, general, modular and built on top of recent improvements in off-policy learning, inspired by ideas from the probabilistic inference literature.
Policy Search by Target Distribution Learning for Continuous Control
TLDR
The experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.
PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation
TLDR
This work proposes PPO-CMA, a proximal policy optimization approach that adaptively expands the exploration variance to speed up progress and is significantly less sensitive to the choice of hyperparameters, allowing one to use it in complex movement optimization tasks without requiring tedious tuning.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Safe and Efficient Off-Policy Reinforcement Learning
TLDR
A novel algorithm, Retrace ($\lambda$), is derived, believed to be the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration).
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
TLDR
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Bridging the Gap Between Value and Policy Based Reinforcement Learning
TLDR
A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Variational Inference for Policy Search in changing situations
TLDR
Variational Inference for Policy Search (VIP) has several interesting properties and meets the performance of state-of-the-art methods while being applicable to simultaneously learning in multiple situations.
Maximum a Posteriori Policy Optimisation
TLDR
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.
Regularized Policy Iteration
TLDR
This paper proposes two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD).
Expected Policy Gradients
TLDR
A new general policy gradient theorem is established, of which the stochastic and deterministic policy gradient theorems are special cases, and it is proved that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead.
Guided Policy Search via Approximate Mirror Descent
TLDR
A new guided policy search algorithm is derived that is simpler and provides appealing improvement and convergence guarantees in simplified convex and linear settings, and it is shown that in the more general nonlinear setting, the error in the projection step can be bounded.
...
...