Corpus ID: 231934124

Top-k eXtreme Contextual Bandits with Arm Hierarchy

  title={Top-k eXtreme Contextual Bandits with Arm Hierarchy},
  author={Rajat Sen and Alexander Rakhlin and Lexing Ying and Rahul Kidambi and Dean P. Foster and Daniel N. Hill and Inderjit S. Dhillon},
Motivated by modern applications, such as online advertisement and recommender systems, we study the top-k eXtreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select k arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-eXtreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of O(k… Expand

Figures and Tables from this paper

Model Selection for Generic Contextual Bandits
This work proposes a successive refinement based algorithm called Adaptive Contextual Bandit (ACB), that works in phases and successively eliminates model classes that are too simple to fit the given instance and proves that this algorithm is adaptive. Expand
Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability
This work designs a fast and simple algorithm that achieves the statistically optimal regret with only O(log T) calls to an offline least-squares regression oracle across all T rounds, providing the first universal and optimal reduction from contextual bandits to offline regression. Expand


Top-$k$ Combinatorial Bandits with Full-Bandit Feedback
This work presents the Combinatorial Successive Accepts and Rejects (CSAR) algorithm, which generalizes SAR (Bubeck et al, 2013) for top-k combinatorial bandits, and presents an efficient sampling scheme that uses Hadamard matrices in order to estimate accurately the individual arms' expected rewards. Expand
Thompson Sampling for Contextual Bandits with Linear Payoffs
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed. Expand
Tighter Bounds for Multi-Armed Bandits with Expert Advice
A new algorithm, similar in spirit to EXP4, which has a bound ofO( √ TS logM), the S parameter measures the extent to which expert recommendations agree; the key to this algorithm is a linear-programing-based exploration strategy that is optimal in a certain sense. Expand
Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation
Experiments conducted on real-wold movie recommendation dataset demonstrate that the principled approach called contextual combinatorial bandit can effectively address the above challenges and hence improve the performance of recommendation task. Expand
Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective
A family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds for contextual bandits are introduced and new oracle-efficient algorithms which adapt to the gap whenever possible are introduced, while also attaining the minimax rate in the worst case. Expand
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of K actions in response to the observed context, and observes the reward only for thatExpand
Stochastic Linear Optimization under Bandit Feedback
A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented. Expand
A Contextual Bandit Bake-off
This work uses the availability of large numbers of supervised learning datasets to compare and empirically optimize contextual bandit algorithms, focusing on practical methods that learn by relying on optimization oracles from supervised learning. Expand
Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem
A new smoothness criterion is introduced, which is term Gini-weighted smoothness, that takes into account both the nonlinearity of the reward and concentration properties of the arms, and shows that a linear dependence of the regret in the batch size in existing algorithms can be replaced by this smoothness parameter. Expand
Contextual Gaussian Process Bandit Optimization
This work model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develops CGP-UCB, an intuitive upper-confidence style algorithm that shows that context-sensitive optimization outperforms no or naive use of context. Expand