• Corpus ID: 227053946

Fully Gap-Dependent Bounds for Multinomial Logit Bandit

@article{Yang2021FullyGB,
  title={Fully Gap-Dependent Bounds for Multinomial Logit Bandit},
  author={Jiaqi Yang},
  journal={ArXiv},
  year={2021},
  volume={abs/2011.09998}
}
  • Jiaqi Yang
  • Published 19 November 2020
  • Computer Science
  • ArXiv
We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most $K$ from a pool of $N$ items, and the buyer purchases an item from the assortment according to a MNL choice model. The objective is to learn the model parameters and maximize the expected revenue. We present (i) an algorithm that identifies the optimal assortment $S^*$ within $\widetilde{O}(\sum_{i = 1}^N \Delta_i^{-2})$ time steps with high probability, and (ii) an… 
Instance-Sensitive Algorithms for Pure Exploration in Multinomial Logit Bandit
TLDR
This paper gives efficient algorithms for pure exploration in MNL-bandit that achieve instance-sensitive pull complexities and complement the upper bounds by an almost matching lower bound.

References

SHOWING 1-10 OF 36 REFERENCES
Thompson Sampling for the MNL-Bandit
TLDR
An approach to adapt Thompson Sampling to this problem is presented and it is shown that it achieves near-optimal regret as well as attractive numerical performance.
MNL-Bandit: A Dynamic Learning Approach to Assortment Selection
TLDR
An efficient algorithm is given that simultaneously explores and exploits, achieving performance independent of the underlying parameters, and is adaptive in the sense that its performance is near-optimal in both the "well separated" case, as well as the general parameter setting where this separation need not hold.
Combinatorial Bandits with Relative Feedback
We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute.
A Near-Optimal Exploration-Exploitation Approach for Assortment Selection
TLDR
It is shown that by exploiting the specific characteristics of the MNL model it is possible to design an algorithm with Õ(√NT) regret, under a mild assumption, and it is demonstrated that this performance is nearly optimal.
Top-$k$ Combinatorial Bandits with Full-Bandit Feedback
TLDR
This work presents the Combinatorial Successive Accepts and Rejects (CSAR) algorithm, which generalizes SAR (Bubeck et al, 2013) for top-k combinatorial bandits, and presents an efficient sampling scheme that uses Hadamard matrices in order to estimate accurately the individual arms' expected rewards.
Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint
TLDR
This work develops an adaptive policy that learns the unknown parameters from past data and at the same time optimizes the profit and develops a simple algorithm for computing a profit-maximizing assortment based on the geometry of lines in the plane.
Adaptive Multiple-Arm Identification
TLDR
A new hardness parameter for characterizing the difficulty of any given instance is introduced and a lower bound result is proved showing that the extra $\log(\epsilon^{-1})$ is necessary for instance-dependent algorithms using the introduced hardness parameter.
Near-Optimal Policies for Dynamic Multinomial Logit Assortment Selection Models
TLDR
This paper shows that a trisection based algorithm achieves an item-independent regret bound of O(sqrt(T log log T), which matches information theoretical lower bounds up to iterated logarithmic terms.
Combinatorial Multi-Armed Bandit with General Reward Functions
TLDR
A new algorithm called stochastic combinatorial multi-armed bandit (CMAB) framework is studied, which allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables.
A Nearly Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model
TLDR
This work designs a new active ranking algorithm without using any information about the underlying items' preference scores, and establishes a matching lower bound on the sample complexity even when the set of preference scores is given to the algorithm.
...
1
2
3
4
...