Corpus ID: 236772651

Debiasing Samples from Online Learning Using Bootstrap

  title={Debiasing Samples from Online Learning Using Bootstrap},
  author={Ningyuan Chen and Xuefeng Gao and Yi Xiong},
  • Ningyuan Chen, Xuefeng Gao, Yi Xiong
  • Published 2021
  • Computer Science, Mathematics
  • ArXiv
It has been recently shown in the literature [30, 35, 36] that the sample averages from online learning experiments are biased when used to estimate the mean reward. To correct the bias, off-policy evaluation methods, including importance sampling and doubly robust estimators, typically calculate the propensity score, which is unavailable in this setting due to unknown reward distribution and the adaptive policy. This paper provides a procedure to debias the samples using bootstrap, which doesn… Expand

Figures and Tables from this paper


Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation
This work identifies conditions under which statistical bootstrapping in this setting is guaranteed to yield correct confidence intervals and evaluates the proposed method and shows that it can yield accurate confidence intervals in a variety of conditions. Expand
Bootstrapping Statistical Inference for Off-Policy Evaluation
This paper proposes a bootstrapping FQE method for inferring the distribution of the policy evaluation error and shows that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. Expand
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. Expand
Bootstrapped Thompson Sampling and Deep Exploration
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. TheExpand
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
This work develops a learning principle and an efficient algorithm for batch learning from logged bandit feedback and shows how CRM can be used to derive a new learning method - called Policy Optimizer for Exponential Models (POEM - for learning stochastic linear rules for structured output prediction. Expand
More Efficient Off-Policy Evaluation through Regularized Targeted Learning
A novel doubly-robust estimator is introduced for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature, which shows empirically that this estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Expand
Bootstrapping Upper Confidence Bound
A non-parametric and data-dependent UCB algorithm based on the multiplier bootstrap is proposed, which derives both problem-dependent and problem-independent regret bounds for multi-armed bandits under a much weaker tail assumption than the standard sub-Gaussianity. Expand
Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
A bandit algorithm that explores by randomizing its history of rewards by pulling the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards that easily generalizes to structured problems. Expand
Toward Minimax Off-policy Value Estimation
It is shown that while the so-called regression estimator is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions. Expand
On the bias, risk and consistency of sample means in multi-armed bandits
A thorough and systematic treatment of the bias, risk and consistency of MAB sample means is delivered and it is demonstrated that a new notion of \emph{effective sample size} can be used to bound the risk of the sample mean under suitable loss functions. Expand