Advancements in Dueling Bandits

@inproceedings{Sui2018AdvancementsID,
  title={Advancements in Dueling Bandits},
  author={Yanan Sui and Masrour Zoghi and Katja Hofmann and Yisong Yue},
  booktitle={IJCAI},
  year={2018}
}
The dueling bandits problem is an online learning framework where learning happens ``on-the-fly'' through preference feedback, i.e., from comparisons between a pair of actions. Unlike conventional online learning settings that require absolute feedback for each action, the dueling bandits framework assumes only the presence of (noisy) binary feedback about the relative quality of each pair of actions. The dueling bandits problem is well-suited for modeling settings that elicit subjective or… 

Tables from this paper

Preference-based Online Learning with Dueling Bandits: A Survey
TLDR
The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits, and to provide an overview of problems that have been considered in the literature as well as methods for tackling them.
Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences
TLDR
This work proposes a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in Dueling bandits.
Batched Dueling Bandits
TLDR
This work studies the batched K-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and Stochastic triangle inequality, and obtains algorithms with a smooth trade-off between the number of batches and regret.
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
TLDR
A new algorithm is provided that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
KLUCB Approach to Copeland Bandits
TLDR
This work proposes a new method called Sup-KLUCB for K-armed dueling bandit problem specifically Copeland dueling bandits by converting it into a standard MAB problem, which outperform state of the art Double Thompson Sampling(DTS) in case of Copeland Dueling bandits.
On testing transitivity in online preference learning
TLDR
This paper introduces an algorithmic framework for the dueling bandits problem, in which the statistical validity of weak stochastic transitivity can be tested, either actively or passively, based on a multiple binomial hypothesis test, and derives lower bounds on the expected sample complexity of any sequential hypothesis testing algorithm for various forms of stochastically transitivity.
Dueling Bandits with Adversarial Sleeping
TLDR
The problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA) is introduced and two algorithms are proposed, with near optimal regret guarantees, which are corroborated empirically.
Simple Algorithms for Dueling Bandits
TLDR
It is proved that the algorithms presented have regret bounds for time horizon T of order O(T^rho ) with 1/2 <= rho <= 3/4, which importantly do not depend on any preference gap between actions, Delta.
Dueling Posterior Sampling for Preference-Based Reinforcement Learning
TLDR
A Bayesian approach for the credit assignment problem is developed, translating preferences to a posterior distribution over state-action reward models, and an asymptotic Bayesian no-regret rate is proved for DPS with a Bayesian linear regression credit assignment model.
MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation
TLDR
This paper proposes Merge Double Thompson Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that localizes the comparisons carried out by the algorithm to small batches of rankers, and then employs Thompson Sampled (TS) to reduce the comparisons between suboptimal rankers inside these small batches.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
Multi-dueling Bandits with Dependent Arms
TLDR
This paper proposes the self-sparring algorithm, which reduces the multi-dueling bandits problem to a conventional bandit setting that can be solved using a stochastic bandit algorithm such as Thompson Sampling, and can naturally model dependencies using a Gaussian process prior.
Contextual Dueling Bandits
TLDR
This work proposes a new and natural solution concept, rooted in game theory, called a von Neumann winner, a randomized policy that beats or ties every other policy, and presents three efficient algorithms for online learning in this setting, and for approximating a vonNeumann winner from batch-like data.
Beat the Mean Bandit
TLDR
This paper presents the first algorithm for this more general Dueling Bandits Problem and provides theoretical guarantees in both the online and the PAC settings and shows that the new algorithm has stronger guarantees than existing results even in the original DuelingBandits Problem, which is validated empirically.
The K-armed Dueling Bandits Problem
Reducing Dueling Bandits to Cardinal Bandits
TLDR
Three reductions - named Doubler, MultiSBM and Sparring - provide a generic schema for translating the extensive body of known results about conventional Multi-Armed Bandit algorithms to the Dueling Bandits setting, and prove regret upper bounds in both finite and infinite settings.
Multi-Dueling Bandits and Their Application to Online Ranker Evaluation
TLDR
This work proposes a generalization of the dueling bandits model that uses simultaneous comparisons of an unrestricted number of rankers and shows that the algorithm yields orders of magnitude gains in performance compared to state-of-the-art dueling bandit algorithms.
Sparse Dueling Bandits
TLDR
It is proved that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of the inverse squared gaps between the Borda scores of each suboptimal arm and the best arm, which motivates a new algorithm called Successive Elimination with Comparison Sparsity (SECS) that exploits sparsity to find the BordA winner using fewer samples than standard algorithms.
Instance-dependent Regret Bounds for Dueling Bandits
TLDR
This paper proposes a new algorithm whose regret, relative to a unique von Neumann winner with sparsitys, is at most ~ O( p sT ), plus an instance-dependent constant, when the sparsity is much smaller.
Regret Analysis for Continuous Dueling Bandit
TLDR
A stochastic mirror descent algorithm is proposed and it is shown that the algorithm achieves an $O(\sqrt{T\log T})$-regret bound under strong convexity and smoothness assumptions for the cost function.
Double Thompson Sampling for Dueling Bandits
TLDR
Simulation results based on both synthetic and real-world data demonstrate the efficiency of the proposed Double Thompson Sampling algorithm for dueling bandit problems.
...
...