Counterfactual Learning with General Data-generating Policies
@article{Narita2022CounterfactualLW, title={Counterfactual Learning with General Data-generating Policies}, author={Yusuke Narita and Kyohei Okumura and Akihiro Shimizu and Kohei Yata}, journal={ArXiv}, year={2022}, volume={abs/2212.01925} }
Off-policy evaluation (OPE) attempts to predict the perfor- mance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method’s prediction converges…
18 References
Off-policy Bandits with Deficient Support
- Computer ScienceKDD
- 2020
This work systematically analyzed the statistical and computational properties of three approaches that provide various guarantees for IPS-based learning despite the inherent limitations of support-deficient data: restricting the action space, reward extrapolation, and restricting the policy space.
Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting
- Computer ScienceAISTATS
- 2021
A new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a desired coverage is proposed, built around the so-called Self-normalized Importance Weighting (SN) estimator.
The Self-Normalized Estimator for Counterfactual Learning
- Computer ScienceNIPS
- 2015
This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that…
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
- Computer ScienceICML
- 2020
An easily computable confidence bound for the policy evaluator is provided, which may be useful for optimistic planning and safe policy improvement, and establishes a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound.
Algorithm is Experiment: Machine Learning, Market Design, and Policy Eligibility Rules
- EconomicsArXiv
- 2021
This work develops a treatment-effect estimator for a class of stochastic and deterministic decision-making algorithms and applies it to evaluate the effect of the Coronavirus Aid, Relief, and Economic Security (CARES) Act.
Learning from Logged Implicit Exploration Data
- Computer ScienceNIPS
- 2010
We provide a sound and consistent foundation for the use of nonrandom exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learned. The…
Eligibility Traces for Off-Policy Policy Evaluation
- Computer ScienceICML
- 2000
This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
Doubly Robust Policy Evaluation and Optimization
- Computer ScienceArXiv
- 2015
It is proved that the doubly robust estimation method uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice in policy evaluation and optimization.
Doubly robust off-policy evaluation with shrinkage
- Computer Science, MathematicsICML
- 2020
We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the…
More Robust Doubly Robust Off-policy Evaluation
- Computer ScienceICML
- 2018
This paper proposes alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator, and proves that the MRDR estimators are strongly consistent and asymptotically optimal.