High-Confidence Off-Policy Evaluation

  title={High-Confidence Off-Policy Evaluation},
  author={Philip S. Thomas and Georgios Theocharous and Mohammad Ghavamzadeh},
Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not… 

Figures and Tables from this paper

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates.

High Confidence Off-Policy Evaluation with Models

This work proposes two bootstrapping approaches combined with learned MDP transition models in order to efficiently estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces and derives a theoretical upper bound on model bias.

On- and Off-Policy Monotonic Policy Improvement

This paper shows that the monotonic policy improvement is guaranteed from on- and off-policy mixture samples, and provides a trust region policy optimization method using experience replay as a naive application of the proposed bound.

High Confidence Policy Improvement

We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that require

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators.

A Configurable off-Policy Evaluation with Key State-Based Bias Constraints in AI Reinforcement Learning

This paper develops a configurable OPE with key state-based bias constraints, and adopts FP-Growth to mine the key states and get corresponding reward expectations of key states, and configures every reward expectation scope as bias constraint to construct new goal function with the combination of bias and variance.

Off-policy Model-based Learning under Unknown Factored Dynamics

The G-SCOPE algorithm is introduced that evaluates a new policy based on data generated by the existing policy and is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment.

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data are proposed.

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

A new variational framework is proposed which reduces the problem of calculating tight confidence bounds in OPE into an optimization problem on a feasible set that catches the true state-action value function with high probability.

Marginalized Off-Policy Evaluation for Reinforcement Learning

A marginalized im-portance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step and is likely to be the first OPE estimator with provably optimal dependence in H and the second moments of the importance weight.



Eligibility Traces for Off-Policy Policy Evaluation

This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.

Regularized Off-Policy TD-Learning

A novel l1 regularized off-policy convergent TD-learning method, which is able to learn sparse representations of value functions with low computational complexity and low computational cost, is presented.

Reinforcement Learning: An Introduction

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Introduction to Reinforcement Learning

In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning.

Offline policy evaluation across representations with applications to educational games

A data-driven methodology for comparing and validating policies offline, which focuses on the ability of each policy to generalize to new data and applies to a partially-observable, high-dimensional concept sequencing problem in an educational game.

Planning treatment of ischemic heart disease with partially observable Markov decision processes

GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces

The GQ(λ) algorithm is introduced which can be seen as extension of that work to a more general setting including eligibility traces and off-policy learning of temporally abstract predictions.

Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm

This paper studies the application of the actor-critic architecture, with neural networks for the both the actor and the critic, as a controller that can adapt to changing dynamics of a human arm.


Abstract : Consider a random variable X with a continuous cumulative distribution function F(x) such that F(a) = 0 and F(b) = 1 for known finite numbers a and b (a < b). The distribution function

Empirical Bernstein Bounds and Sample-Variance Penalization

Improved constants for data dependent and variance sensitive confidence bounds are given, called empirical Bernstein bounds, and extended to hold uniformly over classes of functions whose growth function is polynomial in the sample size n, and sample variance penalization is considered.