• Corpus ID: 19224657

Thompson Sampling for the MNL-Bandit

@article{Agrawal2017ThompsonSF,
  title={Thompson Sampling for the MNL-Bandit},
  author={Shipra Agrawal and Vashist Avadhanula and Vineet Goyal and Assaf J. Zeevi},
  journal={ArXiv},
  year={2017},
  volume={abs/1706.00977}
}
We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision… 

Figures from this paper

Multinomial Logit Bandit with Linear Utility Functions
TLDR
This paper considers the linear utility MNL choice model whose item utilities are represented as linear functions of d-dimension item features, and proposes an algorithm, titled LUMB, to exploit the underlying structure and proves that the proposed algorithm achieves regret which is free of candidate set size.
Thompson Sampling for Multinomial Logit Contextual Bandits
TLDR
A dynamic assortment selection problem where the goal is to offer a sequence of assortments that maximizes the expected cumulative revenue, or alternatively, minimize the expected regret, and proposes two Thompson sampling algorithms for this multinomial logit contextual bandit.
Multinomial Logit Contextual Bandits: Provable Optimality and Practicality
TLDR
This work considers a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown, and proposes upper confidence bound based algorithms for this MNL contextual bandit.
MNL-Bandit: A Dynamic Learning Approach to Assortment Selection
TLDR
An efficient algorithm is given that simultaneously explores and exploits, achieving performance independent of the underlying parameters, and is adaptive in the sense that its performance is near-optimal in both the "well separated" case, as well as the general parameter setting where this separation need not hold.
Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit
TLDR
This work proposes an optimistic algorithm with a carefully designed exploration bonus term and shows that it enjoys $\tilde{\mathrm{O}}(\sqrt{T})$ regret, and bounds the $\kappa$ factor only affects the poly-log term and not the leading term of the regret bounds.
Learning to Rank under Multinomial Logit Choice
TLDR
This work introduces a multinomial logit (MNL) choice model to the LTR framework, which captures the behaviour of users who consider the ordered list of items as a whole and make a single choice among all the items and a no-click option.
Multinomial Logit Contextual Bandits
TLDR
This work considers a dynamic assortment selection problem where the goal is to offer an assortment with cardinality constraint K from a set of N possible items, and proposes upper confidence interval based algorithms for this multinomial logit contextual bandit.
Choice Bandits
TLDR
An algorithm for choice bandits, termed Winner Beats All (WBA), with a distribution dependent O(log T ) regret bound under all these choice models is proposed, which is competitive with previous dueling bandit algorithms and outperforms the recently proposed MaxMinUCB algorithm designed for the MNL model.
A Thompson Sampling Algorithm for Cascading Bandits
TLDR
Empirical experiments demonstrate superiority of TS-Cascade compared to existing UCB-based procedures in terms of the expected cumulative regret and the time complexity and the first theoretical guarantee on a Thompson sampling algorithm for any stochastic combinatorial bandit problem model with partial feedback.
Fully Gap-Dependent Bounds for Multinomial Logit Bandit
TLDR
To the knowledge, this work is the first to achieve gap-dependent bounds that fully depends on the suboptimality gaps of all items, and an algorithm is presented that incurs regret in time steps.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 37 REFERENCES
Thompson Sampling for Multinomial Logit Contextual Bandits
TLDR
A dynamic assortment selection problem where the goal is to offer a sequence of assortments that maximizes the expected cumulative revenue, or alternatively, minimize the expected regret, and proposes two Thompson sampling algorithms for this multinomial logit contextual bandit.
A Near-Optimal Exploration-Exploitation Approach for Assortment Selection
TLDR
It is shown that by exploiting the specific characteristics of the MNL model it is possible to design an algorithm with Õ(√NT) regret, under a mild assumption, and it is demonstrated that this performance is nearly optimal.
Multinomial Logit Contextual Bandits
TLDR
This work considers a dynamic assortment selection problem where the goal is to offer an assortment with cardinality constraint K from a set of N possible items, and proposes upper confidence interval based algorithms for this multinomial logit contextual bandit.
Optimistic Bayesian Sampling in Contextual-Bandit Problems
TLDR
An approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action is considered, and a new algorithm, Optimistic Bayesian Sampling (OBS), which performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson's method throughout.
Thompson Sampling for Contextual Bandits with Linear Payoffs
TLDR
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed.
Thompson Sampling for Complex Online Problems
TLDR
It is proved a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them, and improved regret bounds are derived for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear reward feedback from subsets.
Linearly Parameterized Bandits
TLDR
It is proved that the regret and Bayes risk is of order Θ(r √T), by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases.
Robust Dynamic Assortment Optimization in the Presence of Outlier Customers
TLDR
A new robust online assortment optimization policy is developed via an active elimination strategy that outperforms the existing policies based on upper confidence bounds (UCB) and Thompson sampling and is optimal up to logarithmic factor in T when the assortment capacity is constant.
Finite-time Analysis of the Multiarmed Bandit Problem
TLDR
This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Dynamic Assortment Selection under the Nested Logit Models
TLDR
This work studies a stylized dynamic assortment planning problem during a selling season of finite length $T, by considering a nested multinomial logit model with $M$ nests and $N$ items per nest and introduces a discretization technique, which leads to a regret at the order of $\tilde{O}(\sqrt{M}T^{2/3}+MNT^{1/3})$ with a specific choice of discretized granularity.
...
1
2
3
4
...