• Corpus ID: 216077420

Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

@article{Zoghi2014RelativeUC,
  title={Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem},
  author={Masrour Zoghi and Shimon Whiteson and R{\'e}mi Munos and M. de Rijke},
  journal={ArXiv},
  year={2014},
  volume={abs/1312.3393}
}
This paper proposes a new method for the K-armed dueling bandit problem, a variation on the regular K-armed bandit problem that offers only relative feedback about pairs of arms. Our approach extends the Upper Confidence Bound algorithm to the relative setting by using estimates of the pairwise probabilities to select a promising arm and applying Upper Confidence Bound with the winner as a benchmark. We prove a sharp finite-time regret bound of order O(K log T) on a very general class of… 

Figures and Tables from this paper

Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem
TLDR
An algorithm inspired by the Deterministic Minimum Empirical Divergence algorithm is proposed, and its regret is analyzed, and the proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound.
Batched Dueling Bandits
TLDR
This work studies the batched K-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and Stochastic triangle inequality, and obtains algorithms with a smooth trade-off between the number of batches and regret.
Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm
TLDR
An efficient version of Copeland Winners Relative Minimum Empirical Divergence (ECW-RMED) is devised and its asymptotic regret bound derived and Experimental comparisons of dueling bandit algorithms show that ECW- RMED significantly outperforms existing ones.
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits
TLDR
An efficient algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) is proposed to handle the adversarial utility-based formulation of this problem, which is a variation of the classical Multi-Armed Bandit problem.
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
TLDR
A new algorithm is provided that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
Sparse Dueling Bandits
TLDR
It is proved that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of the inverse squared gaps between the Borda scores of each suboptimal arm and the best arm, which motivates a new algorithm called Successive Elimination with Comparison Sparsity (SECS) that exploits sparsity to find the BordA winner using fewer samples than standard algorithms.
Simple Algorithms for Dueling Bandits
TLDR
It is proved that the algorithms presented have regret bounds for time horizon T of order O(T^rho ) with 1/2 <= rho <= 3/4, which importantly do not depend on any preference gap between actions, Delta.
Regret Minimization in Stochastic Contextual Dueling Bandits
TLDR
This work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis.
Instance-dependent Regret Bounds for Dueling Bandits
TLDR
This paper proposes a new algorithm whose regret, relative to a unique von Neumann winner with sparsitys, is at most ~ O( p sT ), plus an instance-dependent constant, when the sparsity is much smaller.
Non-Stationary Dueling Bandits
TLDR
The Beat the Winner Reset algorithm is proposed and a bound on its expected binary weak regret in the stationary case is proved, which tightens the bound of current state-of-art algorithms.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
The K-armed Dueling Bandits Problem
Improved Algorithms for Linear Stochastic Bandits
TLDR
A simple modification of Auer's UCB algorithm achieves with high probability constant regret and improves the regret bound by a logarithmic factor, though experiments show a vast improvement.
Beat the Mean Bandit
TLDR
This paper presents the first algorithm for this more general Dueling Bandits Problem and provides theoretical guarantees in both the online and the PAC settings and shows that the new algorithm has stronger guarantees than existing results even in the original DuelingBandits Problem, which is validated empirically.
Relative confidence sampling for efficient on-line ranker evaluation
TLDR
This paper proposes a new method called relative confidence sampling (RCS) that aims to reduce cumulative regret by being less conservative than existing methods in eliminating rankers from contention, and presents an empirical comparison between RCS and two state-of-the-art methods, relative upper confidence bound and SAVAGE.
Sample mean based index policies by O(log n) regret for the multi-armed bandit problem
  • R. Agrawal
  • Computer Science, Mathematics
    Advances in Applied Probability
  • 1995
TLDR
This paper constructs index policies that depend on the rewards from each arm only through their sample mean, and achieves a O(log n) regret with a constant that is based on the Kullback–Leibler number.
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
TLDR
For the first time, it is shown that Thompson Sampling algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem.
Kullback–Leibler upper confidence bounds for optimal sequential allocation
TLDR
The main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively.
Pure Exploration in Multi-armed Bandits Problems
TLDR
The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.
Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting
TLDR
This work analyzes an intuitive Gaussian process upper confidence bound algorithm, and bound its cumulative regret in terms of maximal in- formation gain, establishing a novel connection between GP optimization and experimental design and obtaining explicit sublinear regret bounds for many commonly used covariance functions.
...
1
2
3
4
5
...