• Corpus ID: 3881356

# Thompson Sampling for Combinatorial Semi-Bandits

@article{Wang2018ThompsonSF,
title={Thompson Sampling for Combinatorial Semi-Bandits},
author={Siwei Wang and Wei Chen},
journal={ArXiv},
year={2018},
volume={abs/1803.04623}
}
• Siwei WangWei Chen
• Published 13 March 2018
• Computer Science
• ArXiv
We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of $O(mK_{\max}\log T / \Delta_{\min})$, where $m$ is the number of arms, $K_{\max}$ is the size of the largest super arm, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any…
87 Citations

## Figures from this paper

The first $\mathcal{O}(\log(T)/\Delta)$ approximation regret upper bound for CTS is provided, obtained under a specific condition on the approximation oracle, allowing a reduction to the exact oracle analysis.
• Computer Science
AISTATS
• 2019
This work analyzes the regret of combinatorial Thompson sampling (CTS) for the combinatorially multi-armed bandit with probabilistically triggered arms under the semi-bandit feedback setting and compares CTS with combinatorsial upper confidence bound (CUCB) via numerical experiments on a cascading bandit problem.
• Computer Science
ArXiv
• 2020
It is proved TSCSF-B can satisfy the fairness constraints, and the time-averaged regret is upper bounded by $\frac{N}{2\eta} + O\left(\frac{\sqrt{mNT\ln T}}{T}\right)$, which is the first problem-independent bound of TS algorithms for combinatorial sleeping multi-armed semi-bandit problems.
• Computer Science
ArXiv
• 2018
Empirical experiments demonstrate superiority of TS-Cascade compared to existing UCB-based procedures in terms of the expected cumulative regret and the time complexity and the first theoretical guarantee on a Thompson sampling algorithm for any stochastic combinatorial bandit problem model with partial feedback.
• Computer Science
COLT
• 2019
A new smoothness criterion is introduced, which is term Gini-weighted smoothness, that takes into account both the nonlinearity of the reward and concentration properties of the arms, and shows that a linear dependence of the regret in the batch size in existing algorithms can be replaced by this smoothness parameter.
• Computer Science
ArXiv
• 2021
This work considers the problem of maximizing the Conditional Value-at-Risk (CVaR) of the rewards obtained from the super arms of the combinatorial bandit for the two cases of Gaussian and bounded arm rewards and proposes new algorithms that maximize the CVaR.
• Computer Science
AAAI
• 2021
A new, more lenient, regret criterion is suggested that ignores suboptimality gaps smaller than some ε, and a variant of the Thompson Sampling algorithm, called ε-TS, is presented, and its asymptotic optimality is proved in terms of the lenient regret.
• Computer Science
ArXiv
• 2021
It is proved — under mild smoothness conditions — that the CS-UCB algorithm achieves an O(log(T )) instance-dependent regret guarantee and it is proved that when the range of the rewards is bounded, the regret guarantee of CS- UCB algorithm is O( √ T log(T );) in a general setting.
• Computer Science
IEEE/ACM Transactions on Networking
• 2020
This paper considers a very general learning framework called combinatorial multi-armed bandit with probabilistically triggered arms and a very powerful Bayesian algorithm called Combinatorial Thompson Sampling (CTS) and achieves Bayesian regret.
• Computer Science, Mathematics
Proc. ACM Meas. Anal. Comput. Syst.
• 2021
AESCB is implementable in polynomial time O(δ_T^-1 poly(d)) by repeatedly maximizing a linear function over X subject to a linear budget constraint, and showing how to solve these maximization problems efficiently.

## References

SHOWING 1-10 OF 38 REFERENCES

• Computer Science
ArXiv
• 2020
It is proved TSCSF-B can satisfy the fairness constraints, and the time-averaged regret is upper bounded by $\frac{N}{2\eta} + O\left(\frac{\sqrt{mNT\ln T}}{T}\right)$, which is the first problem-independent bound of TS algorithms for combinatorial sleeping multi-armed semi-bandit problems.
• Wei Chen
• Computer Science, Mathematics
NIPS
• 2016
A new algorithm called stochastic combinatorial multi-armed bandit (CMAB) framework is studied, which allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables.
• Computer Science
ICML
• 2015
It is proved that MP-TS for binary rewards has the optimal regret upper bound that matches the regret lower bound provided by Anantharam et al. (1987) and is the first computationally efficient algorithm with optimal regret.
• Computer Science
ICML 2013
• 2013
The regret analysis is tight in that it matches the bound for classical MAB problem up to a constant factor, and it significantly improves the regret bound in a recent paper on combinatorial bandits with linear rewards.
• Computer Science
COLT
• 2012
For the first time, it is shown that Thompson Sampling algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem.
• Computer Science, Economics
• 2001
A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs is given.
• Computer Science
J. Mach. Learn. Res.
• 2016
The regret analysis is tight in that it matches the bound of UCB1 algorithm (up to a constant factor) for the classical MAB problem, and it significantly improves the regret bound in an earlier paper on combinatorial bandits with linear rewards.
This work provides lower bound results showing that the factor 1/p* is unavoidable for general CMAB-T problems, suggesting that the TPM condition is crucial in removing this factor.
• Computer Science
Machine Learning
• 2004
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
• Computer Science
ICML
• 2015
This paper considers efficient learning in large-scale combinatorial semi-bandits with linear generalization, and proposes two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and CombLinUCB, which are computationally efficient and provably statistically efficient under reasonable assumptions.