• Corpus ID: 219636361

Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

  title={Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales},
  author={Masahiro Kato},
This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the… 

Tables from this paper

Theoretical and Experimental Comparison of Off-Policy Evaluation from Dependent Samples

This work theoretically and experimentally compare estimators for off-policy evaluation (OPE) using dependent samples obtained via multi-armed bandit (MAB) algorithms, focusing on a doubly robust (DR) estimator, which consists of an inverse probability weighting (IPW) component and an estimator of the conditionally expected outcome.

A Practical Guide of Off-Policy Evaluation for Bandit Problems

This paper proposes a meta-algorithm based on existing OPE estimators for the situation, and investigates the proposed concepts using synthetic and open real-world datasets in experiments.



Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

The efficiency bound of OPE under a covariate shift is derived by using an estimator of the density ratio between the distributions of the historical and evaluation data and doubly robust and efficient estimators for OPE and OPL are proposed.

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

A novel doubly-robust estimator is introduced for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature, which shows empirically that this estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification.

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

Adaptive Experimental Design for Efficient Treatment Effect Estimation: Randomized Allocation via Contextual Bandit Algorithm

This paper considers a situation called adaptive experimental design where research subjects sequentially visit a researcher, and the researcher assigns a treatment, and an algorithm of the multi-armed bandit problem and the theory of martingale is used to construct an efficient estimator.

Efficient Policy Learning

This paper derives lower bounds for the minimax regret of policy learning under constraints, and proposes a method that attains this bound asymptotically up to a constant factor, Whenever the class of policies under consideration has a bounded Vapnik-Chervonenkis dimension.

Semi-Parametric Efficient Policy Learning with Continuous Actions

This work extends prior approaches of policy optimization from observational data that only considered discrete actions to include observational data where the data collection policy is unknown and needs to be estimated from data.

More Robust Doubly Robust Off-policy Evaluation

This paper proposes alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator, and proves that the MRDR estimators are strongly consistent and asymptotically optimal.

Confidence intervals for policy evaluation in adaptive experiments

The approach is to adaptively reweight the terms of an augmented inverse propensity-weighting estimator to control the contribution of each term to the estimator’s variance, which reduces overall variance and yields an asymptotically normal test statistic.

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

This work proposes new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS.


We study a multi-armed bandit problem in a setting where covariates are available. We take a nonparametric approach to estimate the functional relationship between the response (reward) and the