• Corpus ID: 211677262

Budget-Constrained Bandits over General Cost and Reward Distributions

@inproceedings{Cayci2020BudgetConstrainedBO,
  title={Budget-Constrained Bandits over General Cost and Reward Distributions},
  author={Semih Cayci and Atilla Eryilmaz and Rayadurgam Srikant},
  booktitle={AISTATS},
  year={2020}
}
We consider a budget-constrained bandit problem where each arm pull incurs a random cost, and yields a random reward in return. The objective is to maximize the total expected reward under a budget constraint on the total cost. The model is general in the sense that it allows correlated and potentially heavy-tailed cost-reward pairs that can take on negative values as required by many applications. We show that if moments of order $(2+\gamma)$ for some $\gamma > 0$ exist for all cost-reward… 
An Efficient Pessimistic-Optimistic Algorithm for Constrained Linear Bandits
TLDR
The algorithm is based on the primal-dual approach in optimization, and includes two components: the primal component is similar to unconstrained stochastic linear bandits, and the dual component depends on the number of constraints.
An Efficient Pessimistic-Optimistic Algorithm for Stochastic Linear Bandits with General Constraints
TLDR
The algorithm is based on the primal-dual approach in optimization and includes two components: the primal component is similar to unconstrained stochastic linear bandits and the dual component depends on the number of constraints but is independent of the sizes of the contextual space, the action space, and the feature space.
Continuous-Time Multi-Armed Bandits with Controlled Restarts
TLDR
This work investigates the bandit problem with controlled restarts for time-constrained decision processes, and develops provably good learning algorithms for efficient online learning algorithms with finite and continuous action space of restart strategies.
An Efficient Pessimistic-Optimistic Algorithm for Stochastic Linear Bandits with General Constraints
is the number of constraints, d is the dimension of the reward feature space, and δ is a Slater’s constant; and zero constraint violation in any round τ ą τ 1, where τ 1 is independent of horizon T.
Making the most of your day: online learning for optimal allocation of time
TLDR
Online learning for optimal allocation when the resource to be allocated is time and the regret incurred by the agent is studied, first when she knows her reward function but does not know the distribution of the task duration, and then when she does notknow her reward functions.
Fast and Accurate Online Decision-Making
We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive
POND: Pessimistic-Optimistic oNline Dispatch
TLDR
A novel online dispatch algorithm, named POND, standing for Pessimistic-Optimistic oNline Dispatch, which achieves high regret and constraint violation, and experiments show that POND achieves low regret with minimal constraint violations.
A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback
TLDR
A novel low-complexity algorithm based on Lyapunov optimization methodology, named LyOn, is proposed and it is proved that it achieves O( √ B logB) regret and O(logB/B) constraint-violation.
Group-Fair Online Allocation in Continuous Time
TLDR
This work proposes a novel online learning algorithm based on dual ascent optimization for time averages, and proves that it achieves $\tilde{O}(B^{-1/2})$ regret bound.
...
1
2
...

References

SHOWING 1-10 OF 37 REFERENCES
Multi-Armed Bandit with Budget Constraint and Variable Costs
TLDR
It is shown that when applying the proposed algorithms to a previous setting with fixed costs, one can improve the previously obtained regret bound, and results on real-time bidding in ad exchange verify the effectiveness of the algorithms and are consistent with the theoretical analysis.
Bandits with Budgets: Regret Lower Bounds and Optimal Algorithms
TLDR
Numerical experiments suggest that B-KL-UCB has the same or better finite-time performance when compared to various previously proposed (UCB-like) algorithms, which is important when applying such algorithms to a real-world problem.
Thompson Sampling for Budgeted Multi-Armed Bandits
TLDR
This paper extends the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget, and proves that the distribution-dependent regret bound of this algorithm is O(lnB), where B denotes the budget.
Linear Contextual Bandits with Knapsacks
TLDR
This work combines techniques from the work on linContextual, BwK, and OSPP in a nontrivial manner while also tackling new difficulties that are not present in any of these special cases.
Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits
TLDR
Two pulling policies are developed, namely: (i) KUBE; and (ii) fractional KUBe, which are computationally less expensive and prove logarithmic upper bounds for the regret of both policies, and show that these bounds are asymptotically optimal.
Budgeted Bandit Problems with Continuous Random Costs
TLDR
This work proposes an upper condence bound based algorithms for multi-armed bandits and a condence ball based algorithm for linear bandits, and proves logarithmic regret bounds for both algorithms.
Bandits with concave rewards and convex knapsacks
TLDR
A very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon is considered.
Multi-armed Bandits with Metric Switching Costs
TLDR
A general duality-based framework is developed to provide the first O (1) approximation for metric switching costs; the actual constants being quite small.
Multi-armed bandit problems with heavy-tailed reward distributions
  • K. Liu, Qing Zhao
  • Computer Science, Mathematics
    2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2011
TLDR
An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies and it is shown that when the moment-generating functions of the arm reward distributions are properly bounded, the optimal logarithmic order of the regret can be achieved by DSEE.
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
TLDR
A variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms is considered, providing the first analysis of the expected regret for such algorithms.
...
1
2
3
4
...