To update or not to update? Delayed nonparametric bandits with randomized allocation

  title={To update or not to update? Delayed nonparametric bandits with randomized allocation},
  author={Sakshi Arya and Yuhong Yang},
Delayed rewards problem in contextual bandits has been of interest in various practical settings. We study randomized allocation strategies and provide an understanding on how the exploration–exploitation trade‐off is affected by delays in observing the rewards. In randomized strategies, the extent of exploration–exploitation is controlled by a user‐determined exploration probability sequence. In the presence of delayed rewards, one may choose between using the original exploration sequence… 



Kernel Estimation and Model Combination in A Bandit Problem with Covariates

This work considers a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules.

Bandits with Delayed, Aggregated Anonymous Feedback

An algorithm is provided that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.

Bandits with Delayed Anonymous Feedback

It is demonstrated it is still possible to achieve logarithmic regret, but with additional lower order terms, and provided an algorithm with regret O(log(T ) + √ g( τ) log(T) + g(τ)) where g(σ) is some function of the delay distribution.


We study a multi-armed bandit problem in a setting where covariates are available. We take a nonparametric approach to estimate the functional relationship between the response (reward) and the

Learning in Generalized Linear Contextual Bandits with Stochastic Delays

This paper designs a delay-adaptive algorithm, which is called Delayed UCB, for generalized linear contextual bandits using UCB-style exploration and establishes regret bounds under various delay assumptions and contributes to the broad landscape of contextual bandits literature.

Stochastic Bandits with Delay-Dependent Payoffs

A nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled is proposed and an algorithm whose regret with respect to the best ranking policy is bounded by $\widetilde{\mathcal{O}}\big(\!\sqrt{kT}\big)$.

The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits

The Stochastic Delayed Bandits algorithm is presented, which takes black-box bandit algorithms (including heuristic approaches) as input while achieving good theoretical guarantees and empirical results show that SDB outperforms state-of-the-art approaches to handling delay, heuristics, prior data, and evaluation.

Randomized Allocation with Nonparametric Estimation for Contextual Multi-Armed Bandits with Delayed Rewards

Stochastic Bandit Models for Delayed Conversions

This paper proposes and investigates a new stochas-tic multi-armed bandit model in the framework proposed by Chapelle (2014) --based on empirical studies in the field of web advertising-- in which each action may trigger a future reward that will then happen with a stochAs-tic delay.

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

The focus is on two extreme cases in which the analysis of regret is particularly simple and elegant: independent and identically distributed payoffs and adversarial payoffs.