• Corpus ID: 52932326

PAC Battling Bandits in the Plackett-Luce Model

  title={PAC Battling Bandits in the Plackett-Luce Model},
  author={Aadirupa Saha and Aditya Gopalan},
  booktitle={International Conference on Algorithmic Learning Theory},
We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model--an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top $m$ most preferred items etc. The objective is to identify a near-best item… 

From PAC to Instance-Optimal Sample Complexity in the Plackett-Luce Model

This work considers PAC-learning a good item from $k$-subsetwise feedback information sampled from a Plackett-Luce probability model, with instance-dependent sample complexity performance, and gives an algorithm with optimal instance- dependent sample complexity for PAC best arm identification.

Preselection Bandits under the Plackett-Luce Model

This paper introduces the Preselection Bandit problem, in which the learner preselects a subset of arms for a user, which then chooses the final arm from this subset, and proposes algorithms for which the upper bound on expected regret matches the lower bound up to a logarithmic term of the time horizon.

Combinatorial Bandits with Relative Feedback

We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute.

Best-item Learning in Random Utility Models with Subset Choices

Fundamental lower bounds on PAC sample complexity show that the learning algorithm given, based on pairwise relative counts of items and hierarchical elimination, is near-optimal in terms of its dependence on $n,k$ and $c$.

Preselection Bandits

This paper introduces the Preselection Bandit problem, in which the learner preselects a subset of arms for a user, which then chooses the final arm from this subset, and proposes algorithms for which the upper bound on expected regret matches the lower bound up to a logarithmic term of the time horizon.

Adversarial Dueling Bandits

The problem of regret minimization in Adversarial Dueling Bandits is introduced, and an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K/\Delta^2)$ lower bound.

Online Preselection with Context Information under the Plackett-Luce Model

An extension of the contextual multi-armed bandit problem, in which, instead of selecting a single alternative (arm), a learner is supposed to make a preselection in the form of a subset of alternatives, the CPPL algorithm is proposed, which is inspired by the well-known UCB algorithm.

Regret Minimization in Stochastic Contextual Dueling Bandits

This work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis.

Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

A new algorithm is provided that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.

Choice Bandits

An algorithm for choice bandits, termed Winner Beats All (WBA), with a distribution dependent O(log T ) regret bound under all these choice models is proposed, which is competitive with previous dueling bandit algorithms and outperforms the recently proposed MaxMinUCB algorithm designed for the MNL model.



Battle of Bandits

A novel class of pairwise-subset choice model is developed, for modelling the subset-wise winner feedback and the optimality of Battling-Duel proving a matching regret lower bound of Ω(n log T ), which shows that the flexibility of playing size-k subsets does not really help to gather information faster than the corresponding dueling case, at least for the current subsetwise feedback choice model.

PAC Subset Selection in Stochastic Multi-armed Bandits

The expected sample complexity bound for LUCB is novel even for single-arm selection, and a lower bound on the worst case sample complexity of PAC algorithms for Explore-m is given.

The K-armed Dueling Bandits Problem

Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach

The approach is based on constructing a surrogate probability distribution over rankings based on a sorting procedure, for which the pairwise marginals provably coincide with the marginals of the Plackett-Luce distribution.

Reducing Dueling Bandits to Cardinal Bandits

Three reductions - named Doubler, MultiSBM and Sparring - provide a generic schema for translating the extensive body of known results about conventional Multi-Armed Bandit algorithms to the Dueling Bandits setting, and prove regret upper bounds in both finite and infinite settings.

On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models

This work introduces generic notions of complexity for the two dominant frameworks considered in the literature: fixed-budget and fixed-confidence settings, and provides the first known distribution-dependent lower bound on the complexity that involves information-theoretic quantities and holds when m ≥ 1 under general assumptions.

lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits

It is proved that the UCB procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples is optimal up to constants and also shows through simulations that it provides superior performance with respect to the state-of-the-art.

Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions

A family of UCB-style dueling bandit algorithms for general tournament solutions in social choice theory, which show anytime regret bounds for them and can achieve low regret relative to the target winning set of interest.

A Near-Optimal Exploration-Exploitation Approach for Assortment Selection

It is shown that by exploiting the specific characteristics of the MNL model it is possible to design an algorithm with Õ(√NT) regret, under a mild assumption, and it is demonstrated that this performance is nearly optimal.

Finite-time Analysis of the Multiarmed Bandit Problem

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.