Corpus ID: 202889322

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

@article{Song2020VMPOOM,
  title={V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control},
  author={H. Francis Song and Abbas Abdolmaleki and Jost Tobias Springenberg and Aidan Clark and Hubert Soyer and Jack W. Rae and Seb Noury and Arun Ahuja and Siqi Liu and Dhruva Tirumala and Nicolas Manfred Otto Heess and Dan Belov and Martin A. Riedmiller and Matthew M. Botvinick},
  journal={ArXiv},
  year={2020},
  volume={abs/1909.12238}
}
Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy… Expand
Towards an Understanding of Default Policies in Multitask Policy Optimization
TLDR
This work formally linking the quality of the default policy to its effect on optimization, and derives a principled RPO algorithm for multitask learning with strong performance guarantees. Expand
What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
TLDR
This work implements >50 such “choices” in a unified on-policy deep actor-critic framework, allowing them to investigate their impact in a large-scale empirical study and provides insights and practical recommendations for the training of on- policy deepActor-Critic RL agents. Expand
Optimization Issues in KL-Constrained Approximate Policy Iteration
TLDR
This work compares the use of KL divergence as a constraint vs. as a regularizer, and point out several optimization issues with the widely-used constrained approach, and shows that regularization can improve the optimization landscape of the original objective. Expand
What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
TLDR
This work implements >50 such ``choices'' in a unified on-policy RL framework, allowing them to investigate their impact in a large-scale empirical study and provides insights and practical recommendations for on- policy training of RL agents. Expand
An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning
TLDR
An entropy regularization free mechanism that is designed for policy-based methods, which achieves Closed-form Diversity, Objective-invariant Exploration and Adaptive Trade-off is proposed. Expand
Local Search for Policy Iteration in Continuous Control
TLDR
An algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework and introduces a form of tree search for continuous action spaces. Expand
Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning
TLDR
This work focuses on a series of off-policy inference-based actor-critic algorithms – MPO, AWR, and SAC – to decouple their algorithmic innovations and implementation decisions, and presents unified derivations through a single control-as-inference objective. Expand
A Distributional View on Multi-Objective Policy Optimization
TLDR
This paper proposes a novel algorithm for multi-objective reinforcement learning that enables setting desired preferences for objectives in a scale-invariant way, and uses supervised learning to fit a parametric policy to a combination of these distributions. Expand
Revisiting Design Choices in Proximal Policy Optimization
TLDR
This work revisits standard PPO practices outside the regime of current benchmarks, and exposes three failure modes, and explains why standard design choices are problematic in these cases, and shows that alternative choices of surrogate objectives and policy parameterizations can prevent the failure modes. Expand
Phasic Policy Gradient
TLDR
Phasic Policy Gradient, a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases, significantly improves sample efficiency on the challenging Procgen Benchmark. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 41 REFERENCES
Relative Entropy Regularized Policy Iteration
TLDR
An off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function and can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) or as an addition to a policy iteration scheme. Expand
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
TLDR
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand
Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
TLDR
The experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks, and the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks. Expand
Maximum a Posteriori Policy Optimisation
TLDR
This work introduces a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective and develops two off-policy algorithms that are competitive with the state-of-the-art in deep reinforcement learning. Expand
Supervised Policy Update for Deep Reinforcement Learning
TLDR
This work proposes a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning, which formulates and solves a constrained optimization problem in the non-parameterized proximal policy space, and converts the ideal policy to a parameterized policy, from which it draws new samples. Expand
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objectiveExpand
Policy Gradient Methods for Reinforcement Learning with Function Approximation
TLDR
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Expand
Continuous control with deep reinforcement learning
TLDR
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs. Expand
Fitted Q-iteration by Advantage Weighted Regression
TLDR
It is shown that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage-weighted regression, which is able to derive a new, computationally efficient F QI algorithm which can even deal with high dimensional action spaces. Expand
Relative Entropy Policy Search
TLDR
The Relative Entropy Policy Search (REPS) method is suggested, which differs significantly from previous policy gradient approaches and yields an exact update step and works well on typical reinforcement learning benchmark problems. Expand
...
1
2
3
4
5
...