Defining admissible rewards for high-confidence policy evaluation in batch reinforcement learning

@article{Prasad2020DefiningAR,
  title={Defining admissible rewards for high-confidence policy evaluation in batch reinforcement learning},
  author={Niranjani Prasad and Barbara E. Engelhardt and Finale Doshi-Velez},
  journal={Proceedings of the ACM Conference on Health, Inference, and Learning},
  year={2020}
}
A key impediment to reinforcement learning (RL) in real applications with limited, batch data is in defining a reward function that reflects what we implicitly know about reasonable behaviour for a task and allows for robust off-policy evaluation. In this work, we develop a method to identify an admissible set of reward functions for policies that (a) do not deviate too far in performance from prior behaviour, and (b) can be evaluated with high confidence, given only a collection of past… Expand
Methods for Reinforcement Learning in Clinical Decision Support
TLDR
A framework for clinician-in-loop decision support for critical care interventions is developed, and methods for Pareto-optimal reinforcement learning are integrated with known procedural constraints in order to consolidate multiple, often conflicting, clinical goals and produce a flexible optimized ordering policy. Expand

References

SHOWING 1-10 OF 54 REFERENCES
Constrained Policy Improvement for Safe and Efficient Reinforcement Learning
TLDR
RBI is designed to attenuate rapid policy changes of low probability actions which were less frequently sampled to avoid catastrophic performance degradation and reduce regret when learning from a batch of past experience. Expand
Learning Safe Policies with Expert Guidance
TLDR
A framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify, and a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge is provided. Expand
Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
TLDR
A sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the $\alpha$-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function is proposed. Expand
High Confidence Policy Improvement
We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that requireExpand
Apprenticeship learning via inverse reinforcement learning
TLDR
This work thinks of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and gives an algorithm for learning the task demonstrated by the expert, based on using "inverse reinforcement learning" to try to recover the unknown reward function. Expand
Safe Policy Learning from Observations
TLDR
A stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior and its primary advantages are its stability in the presence of value estimation errors and the elimination of a policy search process. Expand
Safe Policy Improvement with Baseline Bootstrapping
TLDR
This paper adopts the safe policy improvement (SPI) approach, inspired by the knows-what-it-knows paradigms, and develops two computationally efficient bootstrapping algorithms, a value-based and a policy-based, both accompanied with theoretical SPI bounds. Expand
High-Confidence Off-Policy Evaluation
TLDR
This paper proposes an off-policy method for computing a lower confidence bound on the expected return of a policy and provides confidences regarding the accuracy of their estimates. Expand
Internal Rewards Mitigate Agent Boundedness
TLDR
This work extends agent design to include the meta-optimization problem of selecting internal agent goals (rewards) which optimize the designer's goals, and empirically demonstrate several instances of common agent bounds being mitigated by general internal reward functions. Expand
Inverse Reward Design
TLDR
This work introduces inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP, and introduces approximate methods for solving IRD problems, and uses their solution to plan risk-averse behavior in test MDPs. Expand
...
1
2
3
4
5
...