Safer Reinforcement Learning through Transferable Instinct Networks

@inproceedings{Grbic2021SaferRL,
  title={Safer Reinforcement Learning through Transferable Instinct Networks},
  author={Djordje Grbic and Sebastian Risi},
  booktitle={ALIFE},
  year={2021}
}
Random exploration is one of the main mechanisms through which reinforcement learning (RL) finds well-performing policies. However, it can lead to undesirable or catastrophic outcomes when learning online in safety-critical environments. In fact, safe learning is one of the major obstacles towards real-world agents that can learn during deployment. One way of ensuring that agents respect hard limitations is to explicitly configure boundaries in which they can operate. While this might work in… 
1 Citations

Figures from this paper

Triple-Q: A Model-Free Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation
TLDR
This paper presents the first model-free, simulatorfree reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation and Triple-Q, which is similar to SARSA for unconstrained MDPs, and is computationally efficient.

References

SHOWING 1-10 OF 40 REFERENCES
Safe Reinforcement Learning through Meta-learned Instincts
TLDR
The results suggest that meta-learning augmented with an instinctual network is a promising new approach for safe AI, which may enable progress in this area on a variety of different domains.
Benchmarking Safe Exploration in Deep Reinforcement Learning
TLDR
This work proposes to standardize constrained RL as the main formalism for safe exploration, and presents the Safety Gym benchmark suite, a new slate of high-dimensional continuous control environments for measuring research progress on constrained RL.
Generalizing from a few environments in safety-critical reinforcement learning
TLDR
It is shown that catastrophes can be significantly reduced with simple modifications, including ensemble model averaging and the use of a blocking classifier, and that the uncertainty information from the ensemble is useful for predicting whether a catastrophe will occur within a few steps and hence whether human intervention should be requested.
Combating Deep Reinforcement Learning ’ s Sisyphean Curse with Reinforcement Learning
TLDR
A reward shaping that accelerates learning and guards oscillating policies against repeated catastrophes is learned and intrinsic fear is introduced, a new method that mitigates these problems by avoiding dangerous states.
Learning to be Safe: Deep RL with a Safety Critic
TLDR
This work proposes to learn how to be safe in one set of tasks and environments, and then use that learned intuition to constrain future behaviors when learning new, modified tasks, and empirically studies this form of safety-constrained transfer learning in three challenging domains.
Conservative Safety Critics for Exploration
TLDR
This paper theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence guarantees for the approach, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks.
Safety-Guided Deep Reinforcement Learning via Online Gaussian Process Estimation
TLDR
A novel approach to incorporate estimations of safety to guide exploration and policy search in deep reinforcement learning is proposed by using a cost function to capture trajectory-based safety and formulate the state-action value function of this safety cost as a candidate Lyapunov function and extend control-theoretic results to approximate its derivative using online Gaussian Process estimation.
Safe Exploration Techniques for Reinforcement Learning - An Overview
TLDR
This work overviews different approaches to safety in (semi)autonomous robotics and addresses the issues of how to define safety in the real-world applications (apparently absolute safety is unachievable in the continuous and random real world).
Safe Reinforcement Learning via Shielding
TLDR
This work proposes a new approach to learn optimal policies while enforcing properties expressed in temporal logic by synthesizing a reactive system called a shield that monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification.
A comprehensive survey on safe reinforcement learning
TLDR
This work categorize and analyze two approaches of Safe Reinforcement Learning, based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor and the incorporation of external knowledge or the guidance of a risk metric.
...
...