An Arm-Wise Randomization Approach to Combinatorial Linear Semi-Bandits

  title={An Arm-Wise Randomization Approach to Combinatorial Linear Semi-Bandits},
  author={Kei Takemura and Shinji Ito},
  journal={2019 IEEE International Conference on Data Mining (ICDM)},
  • Kei Takemura, Shinji Ito
  • Published 2019
  • Computer Science, Mathematics
  • 2019 IEEE International Conference on Data Mining (ICDM)
Combinatorial linear semi-bandits (CLS) are widely applicable frameworks of sequential decision-making, in which a learner chooses a subset of arms from a given set of arms associated with feature vectors. Existing algorithms work poorly for the clustered case, in which the feature vectors form several large clusters. This shortcoming is critical in practice because it can be found in many applications, including recommender systems. In this paper, we clarify why such a shortcoming occurs, and… Expand
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions
This paper shows that the CUCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound Õ(d √ kT + dk) for the partition matroid constraints and proposes an algorithm that modifies the reward estimates of arms in the C UCB algorithm and demonstrates that it enjoys the optimal regrets bound. Expand


Efficient Ordered Combinatorial Semi-Bandits for Whole-Page Recommendation
By the adaptation of a minimumcost maximum-flow network, a practical algorithm based on Thompson sampling is derived for the (contextual) combinatorial problem, thus resolving the problem of computational intractability. Expand
Efficient Learning in Large-Scale Combinatorial Semi-Bandits
This paper considers efficient learning in large-scale combinatorial semi-bandits with linear generalization, and proposes two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and CombLinUCB, which are computationally efficient and provably statistically efficient under reasonable assumptions. Expand
Thompson Sampling for Contextual Bandits with Linear Payoffs
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed. Expand
Stochastic Linear Optimization under Bandit Feedback
A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented. Expand
Linear Thompson Sampling Revisited
Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Expand
Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation
Experiments conducted on real-wold movie recommendation dataset demonstrate that the principled approach called contextual combinatorial bandit can effectively address the above challenges and hence improve the performance of recommendation task. Expand
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
For the first time, it is shown that Thompson Sampling algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem. Expand
Improved Algorithms for Linear Stochastic Bandits
A simple modification of Auer's UCB algorithm achieves with high probability constant regret and improves the regret bound by a logarithmic factor, though experiments show a vast improvement. Expand
Contextual Bandits with Linear Payoff Functions
An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem. Expand
Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations
New efficient policies are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown variables, for this combinatorial multi-armed bandit problem. Expand