• Corpus ID: 245329592

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)

  title={Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)},
  author={Florent Delgrange and Ann Now'e and Guillermo A. P'erez},
We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In wellbehaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deepRL. To recover guarantees when applying advanced RL algorithms to more complex… 

Figures and Tables from this paper

Verified Probabilistic Policies for Deep Reinforcement Learning
This paper proposes an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy’s execution, and presents techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and Probabilistic model checking.


AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training
This work investigates settings where a concise abstract model of the safety aspects is given and proposes an RL algorithm that uses this abstract model to learn policies for CMDPs safely, that is without violating the constraints, and proves that this algorithm is safe under the given assumptions.
Verifiable RNN-Based Policies for POMDPs Under Temporal Logic Constraints
This work introduces an iterative modification to the so-called quantized bottleneck insertion technique to create an FSC as a randomized policy with memory, which outperforms traditional POMDP synthesis methods by 3 orders of magnitude within 2% of optimal benchmark values.
Safety-Constrained Reinforcement Learning for MDPs
This work abstracts controller synthesis for stochastic and partially unknown environments in which safety is essential as a Markov decision process in which the expected performance is measured using a cost function that is unknown prior to run-time exploration of the state space.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Safe Reinforcement Learning via Shielding
A new approach to learn optimal policies while enforcing properties expressed in temporal logic by synthesizing a reactive system called a shield, which monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification.
A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes
This paper presents a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states.
DeepMDP: Learning Continuous Latent Space Models for Representation Learning
This work introduces the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization of two tractable losses: prediction of rewards and prediction of the distribution over next latent states, and shows that the optimization of these objectives guarantees the quality of the latent space as a representation of the state space.
Equivalence notions and model minimization in Markov decision processes
Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation
This article introduces Variational State Tabulation, which maps an environment with a high-dimensional state space to an abstract tabular model and shows how VaST can rapidly learn to maximize reward in tasks like 3D navigation and efficiently adapt to sudden changes in rewards or transition probabilities.
Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model
The stochastic latent actor-critic (SLAC) algorithm is proposed: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs.