Defining admissible rewards for high-confidence policy evaluation in batch reinforcement learning

  title={Defining admissible rewards for high-confidence policy evaluation in batch reinforcement learning},
  author={Niranjani Prasad and Barbara E. Engelhardt and Finale Doshi-Velez},
  journal={Proceedings of the ACM Conference on Health, Inference, and Learning},
A key impediment to reinforcement learning (RL) in real applications with limited, batch data is in defining a reward function that reflects what we implicitly know about reasonable behaviour for a task and allows for robust off-policy evaluation. In this work, we develop a method to identify an admissible set of reward functions for policies that (a) do not deviate too far in performance from prior behaviour, and (b) can be evaluated with high confidence, given only a collection of past… 
Methods for Reinforcement Learning in Clinical Decision Support
A framework for clinician-in-loop decision support for critical care interventions is developed, and methods for Pareto-optimal reinforcement learning are integrated with known procedural constraints in order to consolidate multiple, often conflicting, clinical goals and produce a flexible optimized ordering policy.


Constrained Policy Improvement for Safe and Efficient Reinforcement Learning
RBI is designed to attenuate rapid policy changes of low probability actions which were less frequently sampled to avoid catastrophic performance degradation and reduce regret when learning from a batch of past experience.
Learning Safe Policies with Expert Guidance
A framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify, and a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge is provided.
Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
A sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the $\alpha$-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function is proposed.
High Confidence Policy Improvement
We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that require
Apprenticeship learning via inverse reinforcement learning
This work thinks of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and gives an algorithm for learning the task demonstrated by the expert, based on using "inverse reinforcement learning" to try to recover the unknown reward function.
Safe Policy Learning from Observations
A stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior and its primary advantages are its stability in the presence of value estimation errors and the elimination of a policy search process.
Safe Policy Improvement with Baseline Bootstrapping
This paper adopts the safe policy improvement (SPI) approach, inspired by the knows-what-it-knows paradigms, and develops two computationally efficient bootstrapping algorithms, a value-based and a policy-based, both accompanied with theoretical SPI bounds.
High-Confidence Off-Policy Evaluation
This paper proposes an off-policy method for computing a lower confidence bound on the expected return of a policy and provides confidences regarding the accuracy of their estimates.
Internal Rewards Mitigate Agent Boundedness
This work extends agent design to include the meta-optimization problem of selecting internal agent goals (rewards) which optimize the designer's goals, and empirically demonstrate several instances of common agent bounds being mitigated by general internal reward functions.
Inverse Reward Design
This work introduces inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP, and introduces approximate methods for solving IRD problems, and uses their solution to plan risk-averse behavior in test MDPs.