• Corpus ID: 236154865

Design of Experiments for Stochastic Contextual Linear Bandits

  title={Design of Experiments for Stochastic Contextual Linear Bandits},
  author={Andrea Zanette and Kefan Dong and Jonathan N. Lee and Emma Brunskill},
In the stochastic linear contextual bandit setting there exist several minimax pro-cedures for exploration with policies that are reactive to the data being acquired. In practice, there can be a significant engineering overhead to deploy these algorithms, especially when the dataset is collected in a distributed fashion or when a human in the loop is needed to implement a different policy. Exploring with a single non-reactive policy is beneficial in such cases. Assuming some batch contexts are… 

Figures and Tables from this paper

Beyond Ads: Sequential Decision-Making Algorithms in Law and Public Policy
This work highlights several applications of sequential decision-making algorithms in regulation and governance, and discusses areas for needed research to render such methods policy-compliant, more widely applicable, and effective in the public sector.
Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits
A lower bound theorem is proved that surprisingly shows the optimality of the authors' two-phase regret upper bound (up to logarithmic factors) in the full range of the problem parameters, therefore establishing the exact batch-regret tradeoff.
Safe Exploration for Efficient Policy Evaluation and Comparison
Both theoretical analysis and experiments support the usefulness of the proposed methods for safe data collection and an efficient algorithm for computing it for bandit policy evaluation.
A Deep Bayesian Bandits Approach for Anticancer Therapy: Exploration via Functional Prior
This work proposes a novel deep Bayesian bandits framework that uses functional prior to approximate posterior for drug response prediction based on multi-modal information consisting of genomic features and drug structure and shows that this approach outperforms several benchmarks in identifying optimal treatment for a given cell line.


A fully adaptive algorithm for pure exploration in linear bandits
This work proposes the first fully-adaptive algorithm for pure exploration in linear bandits---the task to find the arm with the largest expected reward, which depends on an unknown parameter linearly, and evaluates the performance of the methods by simulations based on both synthetic setting and real-world data.
Gamification of Pure Exploration for Linear Bandits
This work designs the first asymptotically optimal algorithm for fixed-confidence pure exploration in linear bandits, which naturally bypasses the pitfall caused by a simple but difficult instance, that most prior algorithms had to be engineered to deal with explicitly.
Sequential Batch Learning in Finite-Action Linear Contextual Bandits
This work establishes a regret lower bound and provides an algorithm, whose regret upper bound nearly matches the lower bound, that provides a near-complete characterization of sequential decision making in linear contextual bandits when batch constraints are present.
Linear bandits with limited adaptivity and learning distributional optimal design
It is shown that, when the context vectors are adversarially chosen in d-dimensional linear contextual bandits, the learner needs O(d logd logT) policy switches to achieve the minimax-optimal regret, and this is optimal up to poly(logd, loglogT) factors.
Practical Contextual Bandits with Regression Oracles
This work presents a new technique that has the empirical and computational advantages of realizability-based approaches combined with the flexibility of agnostic methods, and typically gives comparable or superior results.
Batched Learning in Generalized Linear Contextual Bandits With General Decision Sets
This letter provides a lower bound that characterizes the fundamental limit of performance in this setting and gives a UCB-based batched learning algorithm whose regret bound, obtained using a self-normalized martingale style analysis, nearly matches this lower bound.
Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation
A new algorithm, COPOE, is proposed that overcomes the sample complexity issue of PC-PG while retaining its robustness to model misspecification, and makes several important algorithmic enhancements, such as enabling data reuse, and uses more refined analysis techniques, which are expected to be more broadly applicable to designing new reinforcement learning algorithms.
Provably Efficient Q-Learning with Low Switching Cost
The main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is $O(H^3SA\log K)$, and a lower bound of $\Omega(HSA)$ on the local switching costs for any no-regret algorithm.
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
This work considers the Multi-armed bandit problem under the PAC (“probably approximately correct”) model and generalizes the lower bound to a Bayesian setting, and to the case where the statistics of the arms are known but the identities of the Arms are not.
Contextual Bandits with Linear Payoff Functions
An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem.