Corpus ID: 231698338

High-Confidence Off-Policy (or Counterfactual) Variance Estimation

  title={High-Confidence Off-Policy (or Counterfactual) Variance Estimation},
  author={Yash Chandak and Shiv Shankar and P. S. Thomas},
Many sequential decision-making systems leverage data collected using prior policies to propose a new policy. For critical applications, it is important that high-confidence guarantees on the new policy’s behavior are provided before deployment, to ensure that the policy will behave as desired. Prior works have studied high-confidence off-policy estimation of the expected return, however, high-confidence off-policy estimation of the variance of returns can be equally critical for high-risk… Expand

Figures from this paper

Universal Off-Policy Evaluation
This paper takes the first steps towards a universal off-policy estimator (UnO)—one that provides off-Policy estimates and high-confidence bounds for any parameter of the return distribution and discusses UnO's applicability in various settings. Expand
Off-Policy Risk Assessment in Contextual Bandits
This paper proposes Off-Policy Risk Assessment (OPRA), a framework that first estimates a target policy’s CDF and then generates plugin estimates for any collection of Lipschitz risks, providing finite sample guarantees that hold simultaneously over the entire class. Expand


Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation
This work identifies conditions under which statistical bootstrapping in this setting is guaranteed to yield correct confidence intervals and evaluates the proposed method and shows that it can yield accurate confidence intervals in a variety of conditions. Expand
Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation
This work proposes using policies over temporally extended actions, called options, and shows that combining these policies with importance sampling can significantly improve performance for long-horizon problems, and derives a new IS algorithm called Incremental Importance Sampling that can provide significantly more accurate estimates for a broad class of domains. Expand
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
A new off-policy estimation method that applies importance sampling directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators is proposed. Expand
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates. Expand
High-Confidence Off-Policy Evaluation
This paper proposes an off-policy method for computing a lower confidence bound on the expected return of a policy and provides confidences regarding the accuracy of their estimates. Expand
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. Expand
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
Two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces are proposed. Expand
Policy Gradients with Variance Related Risk Criteria
A framework for local policy gradient style algorithms for reinforcement learning for variance related criteria for policy gradient algorithms for criteria that involve both the expected cost and the variance of the cost. Expand
Actor-Critic Algorithms for Risk-Sensitive MDPs
This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction, which establish the convergence of the algorithms to locally risk-sensitive optimal policies. Expand
Eligibility Traces for Off-Policy Policy Evaluation
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Expand