Dueling Bandits with Qualitative Feedback

@article{Xu2018DuelingBW,
  title={Dueling Bandits with Qualitative Feedback},
  author={Liyuan Xu and Junya Honda and Masashi Sugiyama},
  journal={ArXiv},
  year={2018},
  volume={abs/1809.05274}
}
We formulate and study a novel multi-armed bandit problem called the qualitative dueling bandit (QDB) problem, where an agent observes not numeric but qualitative feedback by pulling each arm. We employ the same regret as the dueling bandit (DB) problem where the duel is carried out by comparing the qualitative feedback. Although we can naively use classic DB algorithms for solving the QDB problem, this reduction significantly worsens the performance—actually, in the QDB problem, the… 

Figures and Tables from this paper

Combinatorial Pure Exploration of Dueling Bandit

This paper designs a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and uses the FPTAS as an oracle to design a novel pure exploration algorithm CAR-Cond with sample complexity analysis.

Combinatorial Pure Exploration for Dueling Bandits

This paper designs a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and uses the FPTAS as an oracle to design a novel pure exploration algorithm CAR-Cond with sample complexity analysis.

Ordinal Monte Carlo Tree Search

A look at Monte Carlo Tree Search (MCTS), a popular algorithm to solve MDPs, is taken, a reoccurring problem concerning its use of rewards is highlighted, and it is shown that an ordinal treatment of the rewards overcomes this problem.

References

SHOWING 1-10 OF 26 REFERENCES

Sparse Dueling Bandits

It is proved that in the absence of structural assumptions, the sample complexity of this problem is proportional to the sum of the inverse squared gaps between the Borda scores of each suboptimal arm and the best arm, which motivates a new algorithm called Successive Elimination with Comparison Sparsity (SECS) that exploits sparsity to find the BordA winner using fewer samples than standard algorithms.

Copeland Dueling Bandits

A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner,

The K-armed Dueling Bandits Problem

Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem

An algorithm inspired by the Deterministic Minimum Empirical Divergence algorithm is proposed, and its regret is analyzed, and the proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound.

Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions

A family of UCB-style dueling bandit algorithms for general tournament solutions in social choice theory, which show anytime regret bounds for them and can achieve low regret relative to the target winning set of interest.

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

This work formalizes and study the multi-armed bandit problem in a generalized stochastic setting, in which rewards are not assumed to be numerical, and addresses the problem of quantile-based online learning both for the case of a finite and infinite time horizon.

Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing

A new PAC algorithm, which, with probability at least 1 - δ, identifies a set of K arms with regret at most e.g., the sample complexity bound of the algorithm is provided and the lower bound is established, which demonstrates the superior performance of the proposed algorithm.

Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

A sharp finite-time regret bound of order O(K log T) is proved on a very general class of dueling bandit problems that matches a lower bound proven in (Yue et al., 2012).

Double Thompson Sampling for Dueling Bandits

Simulation results based on both synthetic and real-world data demonstrate the efficiency of the proposed Double Thompson Sampling algorithm for dueling bandit problems.

Generic Exploration and K-armed Voting Bandits

A generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits, is proposed, to offer a natural generalization of Dueling bandits for situations where the environment parameters reflect the idiosyncratic preferences of a mixed crowd.