• Corpus ID: 219636361

Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

@article{Kato2020ConfidenceIF,
  title={Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales},
  author={Masahiro Kato},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.06982}
}
This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the… 

Tables from this paper

Theoretical and Experimental Comparison of Off-Policy Evaluation from Dependent Samples

This work theoretically and experimentally compare estimators for off-policy evaluation (OPE) using dependent samples obtained via multi-armed bandit (MAB) algorithms, focusing on a doubly robust (DR) estimator, which consists of an inverse probability weighting (IPW) component and an estimator of the conditionally expected outcome.

A Practical Guide of Off-Policy Evaluation for Bandit Problems

This paper proposes a meta-algorithm based on existing OPE estimators for the situation, and investigates the proposed concepts using synthetic and open real-world datasets in experiments.

References

SHOWING 1-10 OF 42 REFERENCES

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

Adaptive Experimental Design for Efficient Treatment Effect Estimation: Randomized Allocation via Contextual Bandit Algorithm

This paper considers a situation called adaptive experimental design where research subjects sequentially visit a researcher, and the researcher assigns a treatment, and an algorithm of the multi-armed bandit problem and the theory of martingale is used to construct an efficient estimator.

Efficient Policy Learning

This paper derives lower bounds for the minimax regret of policy learning under constraints, and proposes a method that attains this bound asymptotically up to a constant factor, Whenever the class of policies under consideration has a bounded Vapnik-Chervonenkis dimension.

Semi-Parametric Efficient Policy Learning with Continuous Actions

This work extends prior approaches of policy optimization from observational data that only considered discrete actions to include observational data where the data collection policy is unknown and needs to be estimated from data.

More Robust Doubly Robust Off-policy Evaluation

This paper proposes alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator, and proves that the MRDR estimators are strongly consistent and asymptotically optimal.

Confidence intervals for policy evaluation in adaptive experiments

The approach is to adaptively reweight the terms of an augmented inverse propensity-weighting estimator to control the contribution of each term to the estimator’s variance, which reduces overall variance and yields an asymptotically normal test statistic.

Doubly Robust Policy Evaluation and Learning

It is proved that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies, and is expected to become common practice.

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

This work proposes new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS.

RANDOMIZED ALLOCATION WITH NONPARAMETRIC ESTIMATION FOR A MULTI-ARMED BANDIT PROBLEM WITH COVARIATES

We study a multi-armed bandit problem in a setting where covariates are available. We take a nonparametric approach to estimate the functional relationship between the response (reward) and the

Efficient Counterfactual Learning from Bandit Feedback

This work considers offline estimators for the expected reward from a counterfactual policy, and shows them to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators.