@article{Immorlica2019AdversarialBW,
author={Nicole Immorlica and Karthik Abinav Sankararaman and Robert E. Schapire and Aleksandrs Slivkins},
journal={2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)},
year={2019},
pages={202-219}
}
• Published 28 November 2018
• Computer Science
• 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)
We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on…

## Figures from this paper

Non-stationary Bandits with Knapsacks
• Computer Science
ArXiv
• 2022
This paper shows that the traditional notion of variation budget iscient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and proposes a new notion of global non- stationarity measure.
Unifying the stochastic and the adversarial Bandits with Knapsack
• Computer Science
IJCAI
• 2019
This paper proposes EXP3.BwK, a novel algorithm that achieves order optimal regret in the adversarial BwK setup, and incurs an almost optimal expected regret with an additional factor of $\log(B)$ in the stochastic B wK setup.
• Computer Science
NeurIPS
• 2020
It is proved that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1) and that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Θ(T ).
Bandits with Knapsacks beyond the Worst Case
• Computer Science
NeurIPS
• 2021
A general “reduction" is provided from BwK to bandits which takes advantage of some known helpful structure, and applies this reduction to combinatorial semi-bandits, linear contextual bandits, and multinomial-logit bandits.
• Computer Science
ArXiv
• 2020
This work largely resolves worst-case regret bounds for \BwK for one limited resource other than time, and for known, deterministic resource consumption, and bound regret within a given round ("simple regret").
Online Learning with Knapsacks: the Best of Both Worlds
• Computer Science
• 2022
This work provides the first best-of-both-worlds type framework for this setting, with no-regret guarantees both under stochastic and adversarial inputs, and allows the decision maker to handle non-convex reward and cost functions.
Multi-armed Bandits with Cost Subsidy
• Computer Science
AISTATS
• 2021
A novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards is considered.
The Symmetry between Bandits and Knapsacks: A Primal-Dual LP-based Approach
• Computer Science
• 2021
This paper develops a primal-dual based algorithm that achieves a problem-dependent logarithmic regret bound for solving the general BwK problem and proposes a new notion of sub-optimality measure that highlights the important role of knapsacks in determining algorithm regret.
Blocking Bandits
• Computer Science
NeurIPS
• 2019
A novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter, and it is shown that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically optimal.
Bandit Learning for Dynamic Colonel Blotto Game with a Budget Constraint
• Computer Science
2021 60th IEEE Conference on Decision and Control (CDC)
• 2021
An efficient dynamic policy is devised that uses a combinatorial bandit algorithm Edge on the path planning graph as a subroutine for another algorithm LagrangeBwK and it is shown that under the proposed policy, the learner's regret is bounded with high probability by a term sublinear in time horizon T and polynomial with respect to other parameters.

## References

SHOWING 1-10 OF 149 REFERENCES
Unifying the stochastic and the adversarial Bandits with Knapsack
• Computer Science
IJCAI
• 2019
This paper proposes EXP3.BwK, a novel algorithm that achieves order optimal regret in the adversarial BwK setup, and incurs an almost optimal expected regret with an additional factor of $\log(B)$ in the stochastic B wK setup.
Bandits with Knapsacks
• Computer Science
2013 IEEE 54th Annual Symposium on Foundations of Computer Science
• 2013
This work presents two algorithms whose reward is close to the information-theoretic optimum: one is based on a novel "balanced exploration" paradigm, while the other is a primal-dual algorithm that uses multiplicative updates that is optimal up to polylogarithmic factors.
Linear Contextual Bandits with Knapsacks
• Computer Science
NIPS
• 2016
This work combines techniques from the work on linContextual, BwK, and OSPP in a nontrivial manner while also tackling new difficulties that are not present in any of these special cases.
Bandits with concave rewards and convex knapsacks
• Computer Science
EC
• 2014
A very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon is considered.
An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives
• Computer Science
COLT
• 2016
A computationally efficient algorithm is given for this contextual version of multi-armed bandit problem with global knapsack constraints with slightly better regret bounds, by generalizing the approach of Agarwal et al. (2014) for the non-constrained version of the problem.
Approximation Algorithms for Correlated Knapsacks and Non-martingale Bandits
• Computer Science
2011 IEEE 52nd Annual Symposium on Foundations of Computer Science
• 2011
New time-indexed LP relaxations are proposed, using a decomposition and "gap-filling" approach, to convert these fractional solutions to distributions over strategies, and then use the LP values and the time ordering information from these strategies to devise randomized adaptive scheduling algorithms.
• Computer Science
COLT
• 2018
The main idea of the algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer to come up with appropriate optimistic predictions and correction terms in this framework.
The adwords problem: online keyword matching with budgeted bidders under random permutations
• Economics, Education
EC '09
• 2009
The problem of a search engine trying to assign a sequence of search keywords to a set of competing bidders, each with a daily spending limit, is considered, and the current literature on this problem is extended by considering the setting where the keywords arrive in a random order.
Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits
• Computer Science
AAAI
• 2012
Two pulling policies areveloped, namely: (i) KUBE; and (ii) fractional KUBe, and logarithmicupper bounds for the regret of both policies are proved, which are asymptotically optimal.
Online Learning with Vector Costs and Bandits with Knapsacks
• Computer Science
COLT
• 2020
A tight competitive ratio algorithm for adversarial Bandits with Knapsacks (BwK) is obtained, which improves over the O(d \cdot \log T)\$ competitive ratio algorithms of Immorlica et al.