On the Robustness of Safe Reinforcement Learning under Observational Perturbations

  title={On the Robustness of Safe Reinforcement Learning under Observational Perturbations},
  author={Zuxin Liu and Zijian Guo and Zhepeng Cen and Huan Zhang and Jie Tan and Bo Li and Ding Zhao},
Safe reinforcement learning (RL) trains a policy to maximize the task reward while satisfying safety constraints. While prior works focus on the performance optimality, we find that the optimal solutions of many safe RL problems are not robust and safe against carefully designed observational perturbations. We formally analyze the unique properties of designing effective state adversarial attackers in the safe RL setting. We show that baseline adversarial attack techniques for standard RL tasks… 

Figures and Tables from this paper

SafeBench: A Benchmarking Platform for Safety Evaluation of Autonomous Vehicles
This paper considers 8 safety-critical testing scenarios following National Highway Traffic Safety Administration (NHTSA) and develops 4 scenario generation algorithms considering 10 variations for each scenario, and implements 4 deep reinforcement learning-based AD algorithms with 4 types of input to perform fair comparisons on SafeBench.


An Introduction to Reinforcement Learning
Responsive Safety in Reinforcement Learning by PID Lagrangian Methods
This work proposes a novel Lagrange multiplier update method that utilizes derivatives of the constraint function, and introduces a new method to ease controller tuning by providing invariance to the relative numerical scales of reward and cost.
Constrained Policy Optimization
Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations
The state-adversarialMarkov decision process (SA-MDP) is proposed, and a theoretically principled policy regularization is developed which can be applied to a large family of DRL algorithms, including proximal policy optimization (PPO), deep deterministic policy gradient (DDPG) and deep Q networks (DQN), for both discrete and continuous action control problems.
Reward Constrained Policy Optimization
This work presents a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.
Addressing Function Approximation Error in Actor-Critic Methods
This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.
Continuous control with deep reinforcement learning
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Robust Reinforcement Learning: A Review of Foundations and Recent Advances
The resulting survey covers all fundamental concepts underlying the approaches to robust reinforcement learning and their recent advances and addresses the connection of robustness to the risk-based and entropy-regularized RL formulations.
Robust Reinforcement Learning as a Stackelberg Game via Adaptively-Regularized Adversarial Training
This paper develops the Stackelberg Policy Gradient algorithm, which generates challenging yet solvable adversarial environments which benefit RL agents' robust learning and demonstrates better training stability and robustness against different testing conditions in the single-agent robotics control and multi-agent highway merging tasks.
Constrained Variational Policy Optimization for Safe Reinforcement Learning
A novel Expectation-Maximization approach to naturally incorporate constraints during the policy learning so that a provable optimal non-parametric variational distribution could be computed in closed form after a convex optimization (E-step).