# Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

@article{Zoghi2014RelativeUC, title={Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem}, author={Masrour Zoghi and Shimon Whiteson and R{\'e}mi Munos and M. de Rijke}, journal={ArXiv}, year={2014}, volume={abs/1312.3393} }

This paper proposes a new method for the K-armed dueling bandit problem, a variation on the regular K-armed bandit problem that offers only relative feedback about pairs of arms. Our approach extends the Upper Confidence Bound algorithm to the relative setting by using estimates of the pairwise probabilities to select a promising arm and applying Upper Confidence Bound with the winner as a benchmark. We prove a sharp finite-time regret bound of order O(K log T) on a very general class of…

## 95 Citations

Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem

- Computer ScienceCOLT
- 2015

An algorithm inspired by the Deterministic Minimum Empirical Divergence algorithm is proposed, and its regret is analyzed, and the proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound.

Batched Dueling Bandits

- Computer ScienceArXiv
- 2022

This work studies the batched K-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and Stochastic triangle inequality, and obtains algorithms with a smooth trade-off between the number of batches and regret.

Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm

- Computer ScienceICML
- 2016

An efficient version of Copeland Winners Relative Minimum Empirical Divergence (ECW-RMED) is devised and its asymptotic regret bound derived and Experimental comparisons of dueling bandit algorithms show that ECW- RMED significantly outperforms existing ones.

A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

- Computer ScienceICML
- 2015

An efficient algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) is proposed to handle the adversarial utility-based formulation of this problem, which is a variation of the classical Multi-Armed Bandit problem.

Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

- Computer ScienceALT
- 2022

A new algorithm is provided that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.

Sparse Dueling Bandits

- Computer ScienceAISTATS
- 2015

It is proved that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of the inverse squared gaps between the Borda scores of each suboptimal arm and the best arm, which motivates a new algorithm called Successive Elimination with Comparison Sparsity (SECS) that exploits sparsity to find the BordA winner using fewer samples than standard algorithms.

Simple Algorithms for Dueling Bandits

- Computer ScienceArXiv
- 2019

It is proved that the algorithms presented have regret bounds for time horizon T of order O(T^rho ) with 1/2 <= rho <= 3/4, which importantly do not depend on any preference gap between actions, Delta.

Regret Minimization in Stochastic Contextual Dueling Bandits

- Computer ScienceArXiv
- 2020

This work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis.

Instance-dependent Regret Bounds for Dueling Bandits

- Computer ScienceCOLT
- 2016

This paper proposes a new algorithm whose regret, relative to a unique von Neumann winner with sparsitys, is at most ~ O( p sT ), plus an instance-dependent constant, when the sparsity is much smaller.

Non-Stationary Dueling Bandits

- Computer ScienceArXiv
- 2022

The Beat the Winner Reset algorithm is proposed and a bound on its expected binary weak regret in the stationary case is proved, which tightens the bound of current state-of-art algorithms.

## References

SHOWING 1-10 OF 42 REFERENCES

Improved Algorithms for Linear Stochastic Bandits

- Computer ScienceNIPS
- 2011

A simple modification of Auer's UCB algorithm achieves with high probability constant regret and improves the regret bound by a logarithmic factor, though experiments show a vast improvement.

Beat the Mean Bandit

- Computer ScienceICML
- 2011

This paper presents the first algorithm for this more general Dueling Bandits Problem and provides theoretical guarantees in both the online and the PAC settings and shows that the new algorithm has stronger guarantees than existing results even in the original DuelingBandits Problem, which is validated empirically.

Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

- Computer ScienceTheor. Comput. Sci.
- 2009

Relative confidence sampling for efficient on-line ranker evaluation

- Computer ScienceWSDM
- 2014

This paper proposes a new method called relative confidence sampling (RCS) that aims to reduce cumulative regret by being less conservative than existing methods in eliminating rankers from contention, and presents an empirical comparison between RCS and two state-of-the-art methods, relative upper confidence bound and SAVAGE.

Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

- Computer Science, MathematicsAdvances in Applied Probability
- 1995

This paper constructs index policies that depend on the rewards from each arm only through their sample mean, and achieves a O(log n) regret with a constant that is based on the Kullback–Leibler number.

Analysis of Thompson Sampling for the Multi-armed Bandit Problem

- Computer ScienceCOLT
- 2012

For the first time, it is shown that Thompson Sampling algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem.

Kullback–Leibler upper confidence bounds for optimal sequential allocation

- Computer Science
- 2013

The main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively.

Pure Exploration in Multi-armed Bandits Problems

- Computer ScienceALT
- 2009

The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.

Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting

- Computer ScienceIEEE Transactions on Information Theory
- 2012

This work analyzes an intuitive Gaussian process upper confidence bound algorithm, and bound its cumulative regret in terms of maximal in- formation gain, establishing a novel connection between GP optimization and experimental design and obtaining explicit sublinear regret bounds for many commonly used covariance functions.