Counterfactual Learning with General Data-generating Policies

  title={Counterfactual Learning with General Data-generating Policies},
  author={Yusuke Narita and Kyohei Okumura and Akihiro Shimizu and Kohei Yata},
Off-policy evaluation (OPE) attempts to predict the perfor- mance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method’s prediction converges… 

Figures and Tables from this paper

Off-policy Bandits with Deficient Support

This work systematically analyzed the statistical and computational properties of three approaches that provide various guarantees for IPS-based learning despite the inherent limitations of support-deficient data: restricting the action space, reward extrapolation, and restricting the policy space.

Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting

A new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a desired coverage is proposed, built around the so-called Self-normalized Importance Weighting (SN) estimator.

The Self-Normalized Estimator for Counterfactual Learning

This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

An easily computable confidence bound for the policy evaluator is provided, which may be useful for optimistic planning and safe policy improvement, and establishes a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound.

Algorithm is Experiment: Machine Learning, Market Design, and Policy Eligibility Rules

This work develops a treatment-effect estimator for a class of stochastic and deterministic decision-making algorithms and applies it to evaluate the effect of the Coronavirus Aid, Relief, and Economic Security (CARES) Act.

Learning from Logged Implicit Exploration Data

We provide a sound and consistent foundation for the use of nonrandom exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learned. The

Eligibility Traces for Off-Policy Policy Evaluation

This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.

Doubly Robust Policy Evaluation and Optimization

It is proved that the doubly robust estimation method uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice in policy evaluation and optimization.

Doubly robust off-policy evaluation with shrinkage

We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the

More Robust Doubly Robust Off-policy Evaluation

This paper proposes alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator, and proves that the MRDR estimators are strongly consistent and asymptotically optimal.