• Corpus ID: 246411597

Constrained Variational Policy Optimization for Safe Reinforcement Learning

@article{Liu2022ConstrainedVP,
  title={Constrained Variational Policy Optimization for Safe Reinforcement Learning},
  author={Zuxin Liu and Zhepeng Cen and Vladislav Isenbaev and Wei Liu and Zhiwei Steven Wu and Bo Li and Ding Zhao},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.11927}
}
Safe reinforcement learning (RL) aims to learn policies that satisfy certain constraints before deploying them to safety-critical applications. Previous primal-dual style approaches suffer from instability issues and lack optimality guarantees. This paper overcomes the issues from the perspective of probabilistic inference. We introduce a novel Expectation-Maximization approach to naturally incorporate constraints during the policy learning: 1) a provable optimal non-parametric variational… 
Enhancing Safe Exploration Using Safety State Augmentation
TLDR
The approach Simmer (Safe policy IMproveMEnt for RL) is called to reflect the careful nature of these schedules, and it is suggested that simmering a safe algorithm can improve safety during training for both settings.
On the Robustness of Safe Reinforcement Learning under Observational Perturbations
TLDR
Light is shed on the inherited connection between observational robustness and safety in RL and a more effective adversarial training framework is proposed, which is a pioneer work for future safe RL studies.
On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions
TLDR
The applications of the theorems in explaining counterintuitive phenomenon of flow-based model, deriving deep anomaly detection algorithm, and extending one-step robustness guarantee to multiple steps in safe reinforcement learning are discussed.

References

SHOWING 1-10 OF 40 REFERENCES
A Lyapunov-based Approach to Safe Reinforcement Learning
TLDR
This work defines and presents a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints.
Safe Reinforcement Learning Using Advantage-Based Intervention
TLDR
This work proposes a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training and optimizes the agent’s policy using off-the-shelf RL algorithms designed for unconstrained MDPs.
Constrained Policy Optimization
TLDR
Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.
Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning
TLDR
This paper proposes a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while updating the policy in primal space with on-policy likelihood ratio gradient.
Responsive Safety in Reinforcement Learning by PID Lagrangian Methods
TLDR
This work proposes a novel Lagrange multiplier update method that utilizes derivatives of the constraint function, and introduces a new method to ease controller tuning by providing invariance to the relative numerical scales of reward and cost.
First Order Constrained Optimization in Policy Space
TLDR
This work proposes a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints.
Relative Entropy Regularized Policy Iteration
TLDR
An off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function and can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) or as an addition to a policy iteration scheme.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
TLDR
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
TLDR
V-MPO is introduced, an on-policy adaptation of Maximum a Posteriori Policy Optimization that performs policy iteration based on a learned state-value function and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters.
End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks
TLDR
This work proposes a controller architecture that combines a model-free RL-based controller with model-based controllers utilizing control barrier functions (CBFs) and on-line learning of the unknown system dynamics, in order to ensure safety during learning.
...
...