Verifiably safe exploration for end-to-end reinforcement learning

@article{Hunt2021VerifiablySE,
  title={Verifiably safe exploration for end-to-end reinforcement learning},
  author={Nathan Hunt and Nathan Fulton and Sara Magliacane and Nghia Hoang and Subhro Das and Armando Solar-Lezama},
  journal={Proceedings of the 24th International Conference on Hybrid Systems: Computation and Control},
  year={2021}
}
Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey hard constraints during exploration. This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs. Our approach draws on recent advances in object detection and automated reasoning for hybrid dynamical systems. The approach is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of… 

Figures and Tables from this paper

Verified Probabilistic Policies for Deep Reinforcement Learning
TLDR
This paper proposes an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy’s execution, and presents techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and Probabilistic model checking.
A Review of Safe Reinforcement Learning: Methods, Theory and Applications
TLDR
A review of the progress of safe RL from the perspectives of methods, theory and applications, and problems that are crucial for safe RL being deployed in real-world applications, coined as “2H3W” are reviewed.
Exploration in Deep Reinforcement Learning: From Single-Agent to Multi-Agent Domain
TLDR
A comprehensive survey on existing exploration methods for both single-agent and multi-agent RL, identifying several key challenges to efficient exploration and point out a few future directions.
Provably Safe Deep Reinforcement Learning for Robotic Manipulation in Human Environments
TLDR
This work utilizes a fast reachability analysis of humans and manipulators to guarantee that the manipulator comes to a complete stop before a human is within its range and significantly improves the RL performance by preventing episode-ending collisions.
Provably Safe Reinforcement Learning: A Theoretical and Experimental Comparison
TLDR
A categorization for existing provably safe RL methods is introduced, and the theoretical foundations for both continuous and discrete action spaces are presented, showing that indeed only provablysafe RL methods guarantee safety.
Dynamic Shielding for Reinforcement Learning in Black-Box Environments
TLDR
The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and sup-presses undesired explorations due to the shield constructed from the learned model so that potentially unsafe actions can be foreseen before the agent experiences them.
CertRL: formalizing convergence proofs for value and policy iteration in Coq
TLDR
A Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes and a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit.
Ablation Study of How Run Time Assurance Impacts the Training and Performance of Reinforcement Learning Agents
TLDR
By studying multiple RTA approaches in both on-policy and o-policy RL algorithms, this work seeks to understand which RTA methods are most e-ective, whether the agents become dependent on the RTA, and the importance of reward shaping versus safe exploration in RL agent training.
Parrot: Data-Driven Behavioral Priors for Reinforcement Learning
TLDR
This paper proposes a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials from a wide range of previously seen tasks, and shows how this learned prior can be used for rapidly learning new tasks without impeding the RL agent's ability to try out novel behaviors.
MLNav: Learning to Safely Navigate on Martian Terrains
TLDR
Compared to the baseline ENav path planner on board the Perserverance rover, MLNav can provide a significant improvement in multiple key metrics, such as a 10x reduction in collision checks when navigating real Martian terrains, despite being trained with synthetic terrains.
...
...

References

SHOWING 1-10 OF 59 REFERENCES
Benchmarking Safe Exploration in Deep Reinforcement Learning
TLDR
This work proposes to standardize constrained RL as the main formalism for safe exploration, and presents the Safety Gym benchmark suite, a new slate of high-dimensional continuous control environments for measuring research progress on constrained RL.
Verifiably Safe Off-Model Reinforcement Learning
TLDR
This paper introduces verification-preserving model updates, the first approach toward obtaining formal safety guarantees for reinforcement learning in settings where multiple environmental models must be taken into account, through a combination of design-time model updates and runtime model falsification.
Safe Exploration in Continuous Action Spaces
TLDR
This work addresses the problem of deploying a reinforcement learning agent on a physical system such as a datacenter cooling unit or robot, where critical constraints must never be violated, and directly adds to the policy a safety layer that analytically solves an action correction formulation per each state.
Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning
TLDR
It is proved that the approach toward incorporating knowledge about safe control into learning systems preserves safety guarantees, and it is demonstrated that the empirical performance benefits provided by reinforcement learning are retained.
Safe Reinforcement Learning via Shielding
TLDR
This work proposes a new approach to learn optimal policies while enforcing properties expressed in temporal logic by synthesizing a reactive system called a shield that monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification.
End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks
TLDR
This work proposes a controller architecture that combines a model-free RL-based controller with model-based controllers utilizing control barrier functions (CBFs) and on-line learning of the unknown system dynamics, in order to ensure safety during learning.
Learning-Based Model Predictive Control for Safe Exploration
TLDR
This paper presents a learning-based model predictive control scheme that can provide provable high-probability safety guarantees and exploits regularity assumptions on the dynamics in terms of a Gaussian process prior to construct provably accurate confidence intervals on predicted trajectories.
Logically-Constrained Reinforcement Learning
TLDR
It is proved that the first model-free Reinforcement Learning (RL) algorithm to synthesise policies for an unknown Markov Decision Process (MDP), such that a linear time property is satisfied, is guaranteed to find a policy whose traces probabilistically satisfy the LTL property if such a policy exists.
Constrained Policy Optimization
TLDR
Constrained Policy Optimization (CPO) is proposed, the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration, and allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training.
Safe Model-based Reinforcement Learning with Stability Guarantees
TLDR
This paper presents a learning algorithm that explicitly considers safety, defined in terms of stability guarantees, and extends control-theoretic results on Lyapunov stability verification and shows how to use statistical models of the dynamics to obtain high-performance control policies with provable stability certificates.
...
...