• Corpus ID: 54448007

# Relative Entropy Regularized Policy Iteration

@article{Abdolmaleki2018RelativeER,
title={Relative Entropy Regularized Policy Iteration},
author={Abbas Abdolmaleki and Jost Tobias Springenberg and Jonas Degrave and Steven Bohez and Yuval Tassa and Dan Belov and Nicolas Manfred Otto Heess and Martin A. Riedmiller},
journal={ArXiv},
year={2018},
volume={abs/1812.02256}
}
• Published 5 December 2018
• Computer Science
• ArXiv
We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways…

## Figures and Tables from this paper

Modified Actor-Critics
• Computer Science
AAMAS
• 2020
This paper proposes to combine (any kind of) soft greediness with Modified Policy Iteration (MPI), and instantiates this framework with the PPO soft greedness, and shows that it is competitive with the state-of-art off-policy algorithm SAC.
Q-Learning for Continuous Actions with Cross-Entropy Guided Policies
• Computer Science
ArXiv
• 2019
This work proposes a novel approach, called Cross-Entropy Guided Policies, or CGP, that aims to combine the stability and performance of iterative sampling policies with the low computational cost of a policy network.
Zeroth-Order Supervised Policy Improvement
• Hao Sun
• Computer Science
ArXiv
• 2020
It is proved that with a good function structure, the zeroth-order optimization strategy combining both local and global samplings can find the global minima within a polynomial number of samples.
V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
• Computer Science
ICLR
• 2020
V-MPO is introduced, an on-policy adaptation of Maximum a Posteriori Policy Optimization that performs policy iteration based on a learned state-value function and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters.
CONTINUOUS CONTROL
V-MPO is introduced, an on-policy adaptation of Maximum a Posteriori Policy Optimization that performs policy iteration based on a learned statevalue function and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters.
Constrained Variational Policy Optimization for Safe Reinforcement Learning
• Computer Science
ICML
• 2022
A novel Expectation-Maximization approach to naturally incorporate constraints during the policy learning so that a provable optimal non-parametric variational distribution could be computed in closed form after a convex optimization (E-step).
Revisiting Gaussian mixture critics in off-policy reinforcement learning: a sample-based approach
• Computer Science
ArXiv
• 2022
A natural alternative that removes the need for distributional hyperparameters and achieves state-of-the-art performance on a variety of challenging tasks (e.g. the humanoid, dog, quadruped, and manipulator domains) is revisited and an implementation is provided in the Acme agent repository.
Importance Weighted Policy Learning and Adaption
• Computer Science
ArXiv
• 2020
A complementary approach to meta reinforcement learning which is conceptually simple, general, modular and built on top of recent improvements in off-policy learning, inspired by ideas from the probabilistic inference literature.
Policy Search by Target Distribution Learning for Continuous Control
• Computer Science
AAAI
• 2020
The experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.
PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation
• Computer Science
2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP)
• 2020
This work proposes PPO-CMA, a proximal policy optimization approach that adaptively expands the exploration variance to speed up progress and is significantly less sensitive to the choice of hyperparameters, allowing one to use it in complex movement optimization tasks without requiring tedious tuning.

## References

SHOWING 1-10 OF 50 REFERENCES
Proximal Policy Optimization Algorithms
• Computer Science
ArXiv
• 2017
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Safe and Efficient Off-Policy Reinforcement Learning
• Computer Science
NIPS
• 2016
A novel algorithm, Retrace ($\lambda$), is derived, believed to be the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration).
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
• Computer Science
ICML
• 2018
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Bridging the Gap Between Value and Policy Based Reinforcement Learning
• Computer Science
NIPS
• 2017
A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
• Computer Science
NIPS
• 1999
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Variational Inference for Policy Search in changing situations
Variational Inference for Policy Search (VIP) has several interesting properties and meets the performance of state-of-the-art methods while being applicable to simultaneously learning in multiple situations.
Maximum a Posteriori Policy Optimisation
• Computer Science
ICLR
• 2018
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning.
Regularized Policy Iteration
• Computer Science, Mathematics
NIPS
• 2008
This paper proposes two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD).