Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)

  title={Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)},
  author={Florent Delgrange and Ann Now'e and Guillermo A. P'erez},
  booktitle={AAAI Conference on Artificial Intelligence},
We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex… 

Figures and Tables from this paper

The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models

The Wasserstein-Belief-Updater (WBU) is proposed, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update ensuring that the outputted beliefs allow for learning the optimal value function.

Model-based Offline Reinforcement Learning with Local Misspecification

A model-based reinforcement learning performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch is presented and an empirical algorithm for optimal offline policy selection is proposed and a novel safe policy improve- ment theorem is proved.

MSVIPER: Improved Policy Distillation for Reinforcement-Learning-Based Robot Navigation

MSVIPER learns an “expert” policy using any Reinforcement Learning technique involving learning a state-action mapping and then uses imitation learning to learn a decision-tree policy from it and results in efficient decision trees and can accurately mimic the behavior of the expert policy.

Verified Probabilistic Policies for Deep Reinforcement Learning

This paper proposes an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy’s execution, and presents techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and Probabilistic model checking.



Markov Decision Processes: Discrete Stochastic Dynamic Programming

  • M. Puterman
  • Computer Science
    Wiley Series in Probability and Statistics
  • 1994
Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.

Human-level control through

  • 2015

AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training

This work investigates settings where a concise abstract model of the safety aspects is given and proposes an RL algorithm that uses this abstract model to learn policies for CMDPs safely, that is without violating the constraints, and proves that this algorithm is safe under the given assumptions.

Reverb: A Framework For Experience Replay

This paper introduces Reverb: an efficient, extensible, and easy to use system designed specifically for experience replay in RL, and presents the core design and empirical results of Reverb’s performance characteristics.

Steady State Analysis of Episodic Reinforcement Learning

This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's

On Correctness, Precision, and Performance in Quantitative Verification - QComp 2020 Competition Report

This paper surveys the precision guarantees—ranging from exact rational results to statistical confidence statements—offered by the nine participating tools and reports on the experimental evaluation of these trade-offs performed in QComp 2020: the second friendly competition of tools for the analysis of quantitative formal models.

Global PAC Bounds for Learning Discrete Time Markov Chains

This work provides global bounds on the error made by such a learning process, in terms of global behaviors formalized using temporal logic, which cannot exist for the full LTL logic, and provides one ensuring a bound that is uniform over all the formulas of CTL.

Formal Methods with a Touch of Magic

This work synthesizes a stand-alone correct-by-design controller that enjoys the favorable performance of RL, and incorporates a magic book in a bounded model checking (BMC) procedure, which allows us to find numerous traces of the plant under the control of the wizard.