Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales
@article{Kato2020ConfidenceIF, title={Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales}, author={Masahiro Kato}, journal={ArXiv}, year={2020}, volume={abs/2006.06982} }
This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the…
2 Citations
Theoretical and Experimental Comparison of Off-Policy Evaluation from Dependent Samples
- MathematicsArXiv
- 2020
This work theoretically and experimentally compare estimators for off-policy evaluation (OPE) using dependent samples obtained via multi-armed bandit (MAB) algorithms, focusing on a doubly robust (DR) estimator, which consists of an inverse probability weighting (IPW) component and an estimator of the conditionally expected outcome.
A Practical Guide of Off-Policy Evaluation for Bandit Problems
- Computer ScienceArXiv
- 2020
This paper proposes a meta-algorithm based on existing OPE estimators for the situation, and investigates the proposed concepts using synthetic and open real-world datasets in experiments.
References
SHOWING 1-10 OF 42 REFERENCES
Off-Policy Evaluation and Learning for External Validity under a Covariate Shift
- EconomicsNeurIPS
- 2020
The efficiency bound of OPE under a covariate shift is derived by using an estimator of the density ratio between the distributions of the historical and evaluation data and doubly robust and efficient estimators for OPE and OPL are proposed.
More Efficient Off-Policy Evaluation through Regularized Targeted Learning
- Economics, Computer ScienceICML
- 2019
A novel doubly-robust estimator is introduced for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature, which shows empirically that this estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification.
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
- Computer ScienceICML
- 2017
The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
Adaptive Experimental Design for Efficient Treatment Effect Estimation: Randomized Allocation via Contextual Bandit Algorithm
- Mathematics, Computer ScienceArXiv
- 2020
This paper considers a situation called adaptive experimental design where research subjects sequentially visit a researcher, and the researcher assigns a treatment, and an algorithm of the multi-armed bandit problem and the theory of martingale is used to construct an efficient estimator.
Efficient Policy Learning
- Computer Science, EconomicsArXiv
- 2017
This paper derives lower bounds for the minimax regret of policy learning under constraints, and proposes a method that attains this bound asymptotically up to a constant factor, Whenever the class of policies under consideration has a bounded Vapnik-Chervonenkis dimension.
Semi-Parametric Efficient Policy Learning with Continuous Actions
- Mathematics, EconomicsNeurIPS
- 2019
This work extends prior approaches of policy optimization from observational data that only considered discrete actions to include observational data where the data collection policy is unknown and needs to be estimated from data.
More Robust Doubly Robust Off-policy Evaluation
- Computer ScienceICML
- 2018
This paper proposes alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator, and proves that the MRDR estimators are strongly consistent and asymptotically optimal.
Confidence intervals for policy evaluation in adaptive experiments
- MathematicsProceedings of the National Academy of Sciences
- 2021
The approach is to adaptively reweight the terms of an augmented inverse propensity-weighting estimator to control the contribution of each term to the estimator’s variance, which reduces overall variance and yields an asymptotically normal test statistic.
Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning
- Computer ScienceNeurIPS
- 2019
This work proposes new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS.
RANDOMIZED ALLOCATION WITH NONPARAMETRIC ESTIMATION FOR A MULTI-ARMED BANDIT PROBLEM WITH COVARIATES
- Mathematics
- 2002
We study a multi-armed bandit problem in a setting where covariates are available. We take a nonparametric approach to estimate the functional relationship between the response (reward) and the…