Corpus ID: 211677262

Budget-Constrained Bandits over General Cost and Reward Distributions

  title={Budget-Constrained Bandits over General Cost and Reward Distributions},
  author={Semih Cayci and Atilla Eryilmaz and Rayadurgam Srikant},
We consider a budget-constrained bandit problem where each arm pull incurs a random cost, and yields a random reward in return. The objective is to maximize the total expected reward under a budget constraint on the total cost. The model is general in the sense that it allows correlated and potentially heavy-tailed cost-reward pairs that can take on negative values as required by many applications. We show that if moments of order $(2+\gamma)$ for some $\gamma > 0$ exist for all cost-reward… Expand
A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback
A novel low-complexity algorithm based on Lyapunov optimization methodology, named LyOn, is proposed and it is proved that it achieves O( √ B logB) regret and O(logB/B) constraint-violation. Expand
An Efficient Pessimistic-Optimistic Algorithm for Constrained Linear Bandits
The algorithm is based on the primal-dual approach in optimization, and includes two components: the primal component is similar to unconstrained stochastic linear bandits, and the dual component depends on the number of constraints. Expand
An Efficient Pessimistic-Optimistic Algorithm for Stochastic Linear Bandits with General Constraints
  • Xin Liu, Bin Li, Pengyi Shi, Lei Ying
  • Computer Science
  • 2021
The algorithm is based on the primal-dual approach in optimization and includes two components: the primal component is similar to unconstrained stochastic linear bandits and the dual component depends on the number of constraints but is independent of the sizes of the contextual space, the action space, and the feature space. Expand
Continuous-Time Multi-Armed Bandits with Controlled Restarts
This work investigates the bandit problem with controlled restarts for time-constrained decision processes, and develops provably good learning algorithms for efficient online learning algorithms with finite and continuous action space of restart strategies. Expand
POND: Pessimistic-Optimistic oNline Dispatch
A novel online dispatch algorithm, named POND, standing for Pessimistic-Optimistic oNline Dispatch, which achieves high regret and constraint violation, and experiments show that POND achieves low regret with minimal constraint violations. Expand
An Efficient Pessimistic-Optimistic Algorithm for Stochastic Linear Bandits with General Constraints
  • Xin Liu
  • 2021
is the number of constraints, d is the dimension of the reward feature space, and δ is a Slater’s constant; and zero constraint violation in any round τ ą τ 1, where τ 1 is independent of horizon T.Expand
Group-Fair Online Allocation in Continuous Time
This work proposes a novel online learning algorithm based on dual ascent optimization for time averages, and proves that it achieves $\tilde{O}(B^{-1/2})$ regret bound. Expand


Multi-Armed Bandit with Budget Constraint and Variable Costs
It is shown that when applying the proposed algorithms to a previous setting with fixed costs, one can improve the previously obtained regret bound, and results on real-time bidding in ad exchange verify the effectiveness of the algorithms and are consistent with the theoretical analysis. Expand
Bandits with Budgets: Regret Lower Bounds and Optimal Algorithms
Numerical experiments suggest that B-KL-UCB has the same or better finite-time performance when compared to various previously proposed (UCB-like) algorithms, which is important when applying such algorithms to a real-world problem. Expand
Thompson Sampling for Budgeted Multi-Armed Bandits
This paper extends the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget, and proves that the distribution-dependent regret bound of this algorithm is O(lnB), where B denotes the budget. Expand
Linear Contextual Bandits with Knapsacks
This work combines techniques from the work on linContextual, BwK, and OSPP in a nontrivial manner while also tackling new difficulties that are not present in any of these special cases. Expand
Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits
Two pulling policies are developed, namely: (i) KUBE; and (ii) fractional KUBe, which are computationally less expensive and prove logarithmic upper bounds for the regret of both policies, and show that these bounds are asymptotically optimal. Expand
Budgeted Bandit Problems with Continuous Random Costs
This work proposes an upper condence bound based algorithms for multi-armed bandits and a condence ball based algorithm for linear bandits, and proves logarithmic regret bounds for both algorithms. Expand
Bandits with concave rewards and convex knapsacks
A very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon is considered. Expand
Multi-armed Bandits with Metric Switching Costs
A general duality-based framework is developed to provide the first O (1) approximation for metric switching costs; the actual constants being quite small. Expand
Multi-armed bandit problems with heavy-tailed reward distributions
  • K. Liu, Qing Zhao
  • Computer Science, Mathematics
  • 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2011
An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies and it is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE. Expand
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered, providing the first analysis of the expected regret for such algorithms. Expand