• Corpus ID: 225075683

Conservative Safety Critics for Exploration

  title={Conservative Safety Critics for Exploration},
  author={Homanga Bharadhwaj and Aviral Kumar and Nicholas Rhinehart and Sergey Levine and Florian Shkurti and Animesh Garg},
Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every… 

Figures from this paper

C-Learning: Horizon-Aware Cumulative Accessibility Estimation

This work introduces the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon, and shows that these functions obey a recurrence relation, which enables learning from offline interactions.

Guiding Safe Exploration with Weakest Preconditions

In reinforcement learning for safety-critical settings, it is often desirable for the agent to obey safety constraints at all points in time, including during training. We present a novel

Safety-aware Policy Optimisation for Autonomous Racing

This work injects HJ reachability theory into the constrained Markov decision process (CMDP) framework, as a control-theoretical approach for safety analysis via model-free updates on state-action pairs, and demonstrates that the HJ safety value can be learned directly on vision context, the highest-dimensional problem studied via the method to-date.

A Review of Safe Reinforcement Learning: Methods, Theory and Applications

A review of the progress of safe RL from the perspectives of methods, theory and applications, and problems that are crucial for safe RL being deployed in real-world applications, coined as “2H3W” are reviewed.

Exploiting Reward Shifting in Value-Based Deep RL

The key insight is brought that a positive reward shifting leads to conservative exploitation, while a negative reward shift leads to curiosity-driven exploration, which improves offline RL value estimation, and optimistic value estimation improves exploration for online RL.

Do Androids Dream of Electric Fences? Safety-Aware Reinforcement Learning with Latent Shielding

This work presents a novel approach to safetyaware deep reinforcement learning in high-dimensional environments called latent shielding, which leverages internal representations of the environment learnt by modelbased agents to “imagine” future trajectories and avoid those deemed unsafe.

Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning

This article provides a concise but holistic review of the recent advances made in using machine learning to achieve safe decision-making under uncertainties, with a focus on unifying the language and frameworks used in control theory and reinforcement learning research.

Safe Autonomous Racing via Approximate Reachability on Ego-vision

This work proposes to incorporate Hamilton-Jacobi (HJ) reachability theory, a safety verification method for general non-linear systems, into the constrained Markov decision process (CMDP) framework, and demonstrates that with neural approximation, the HJ safety value can be learned directly on vision context—the highestdimensional problem studied via the method, to-date.

Safe Reinforcement Learning Using Advantage-Based Intervention

This work proposes a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training and optimizes the agent’s policy using off-the-shelf RL algorithms designed for unconstrained MDPs.

Constrained Update Projection Approach to Safe Policy Optimization

This study proposes CUP, a novel policy optimization method based on C onstrained U pdate P rojection framework that enjoys rigorous safety guarantee and shows the effectiveness of CUP both in terms of reward and safety constraint satisfaction.



Benchmarking Safe Exploration in Deep Reinforcement Learning

This work proposes to standardize constrained RL as the main formalism for safe exploration, and presents the Safety Gym benchmark suite, a new slate of high-dimensional continuous control environments for measuring research progress on constrained RL.

Safe Exploration in Continuous Action Spaces

This work addresses the problem of deploying a reinforcement learning agent on a physical system such as a datacenter cooling unit or robot, where critical constraints must never be violated, and directly adds to the policy a safety layer that analytically solves an action correction formulation per each state.

Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks

A new model-based reinforcement learning algorithm, SAVED, which uses supervision that only identifies task completion and a modest set of suboptimal demonstrations to constrain exploration and learn efficiently while handling complex constraints, making it feasible to safely learn a control policy directly on a real robot in less than an hour.

Robust Regression for Safe Exploration in Control

A deep robust regression model is presented that is trained to directly predict the uncertainty bounds for safe exploration and can outperform the conventional Gaussian process (GP) based safe exploration in settings where it is difficult to specify a good GP prior.

Safe Policies for Reinforcement Learning via Primal-Dual Methods

It is established that primal-dual algorithms are able to find policies that are safe and optimal, and an ergodic relaxation of the safe-learning problem is proposed.

Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

This work proposes an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and reset policy, with the reset policy resetting the environment for a subsequent attempt.

Constrained Policy Optimization

Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.

Safe Reinforcement Learning through Meta-learned Instincts

The results suggest that meta-learning augmented with an instinctual network is a promising new approach for safe AI, which may enable progress in this area on a variety of different domains.

Learning-Based Model Predictive Control for Safe Exploration

This paper presents a learning-based model predictive control scheme that can provide provable high-probability safety guarantees and exploits regularity assumptions on the dynamics in terms of a Gaussian process prior to construct provably accurate confidence intervals on predicted trajectories.

Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.