Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

  title={Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection},
  author={Yining Wang and Yi Chen and Ethan X. Fang and Zhaoran Wang and Runze Li},
We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the… 

Figures and Tables from this paper

Online Sparse Reinforcement Learning
A lower bound is provided showing that if the learner has oracle access to a policy that collects well-conditioned data then a variant of Lasso fitted Q-iteration enjoys a nearly dimension-free regret, which shows that in the large-action setting, the difficulty of learning can be attributed to the difficulties of finding a good exploratory policy.
Variance-Aware Sparse Linear Bandits
This paper presents the first variance-aware regret guarantee for sparse linear bandits, where σ 2 t is the variance of the noise at the t -th time step, and naturally interpolates the regret bounds for the worst-case constant-variance regime and the benign deterministic regimes.
Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature
A model-based algorithm is presented, Virtual Ascent with Online Model Learner (ViOlin), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class.
Information Directed Sampling for Sparse Linear Bandits
This work explores the use of information-directed sampling (IDS), which naturally balances the information-regret trade-off, and develops a class of informationtheoretic Bayesian regret bounds that nearly match existing lower bounds on a variety of problem instances.
Contextual Information-Directed Sampling
It is provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution and the main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditionalIDS can be myopic.
Stochastic Zeroth-Order Optimization under Nonstationarity and Nonconvexity
This work proposes and analyze stochastic zeroth-order optimization algorithms when the objective being optimized changes with time, and proposes nonstationary versions of regret measures based on second-order optimal solutions, and provides the corresponding regret bounds.


Online Sparse Linear Regression
This work considers the online sparse linear regression problem, which is the problem of sequentially making predictions observing only a limited number of features in each round, to minimize regret with respect to the best sparse linear regressor, and gives an inefficient algorithm that obtains regret bounded by $\tilde{O}(\sqrt{T})$ after $T$ prediction rounds.
Provably Optimal Algorithms for Generalized Linear Contextual Bandits
This work proposes an upper confidence bound based algorithm for generalized linear contextual bandits, which achieves an \tilde{O}(\sqrt{dT}) regret over T rounds with d dimensional feature vectors, and proves it to have optimal regret for the certain cases.
Contextual Gaussian Process Bandit Optimization
This work model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develops CGP-UCB, an intuitive upper-confidence style algorithm that shows that context-sensitive optimization outperforms no or naive use of context.
Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
The notion of sparsity regret bound is introduced, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario and is proved for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation.
Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits
The sparse variant of linear stochastic bandits is introduced and it is shown that a recent online algorithm together with the online-to-confidence-set conversion allows one to derive algorithms that can exploit if the reward is a function of a sparse linear combination of the components of the chosen action.
Stochastic Linear Optimization under Bandit Feedback
A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented.
Linearly Parameterized Bandits
It is proved that the regret and Bayes risk is of order Θ(r √T), by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases.
A contextual-bandit approach to personalized news article recommendation
This work model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks.
Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso)
  • M. Wainwright
  • Computer Science
    IEEE Transactions on Information Theory
  • 2009
This work analyzes the behavior of l1-constrained quadratic programming (QP), also referred to as the Lasso, for recovering the sparsity pattern of a vector beta* based on observations contaminated by noise, and establishes precise conditions on the problem dimension p, the number k of nonzero elements in beta*, and the number of observations n.
Online Decision-Making with High-Dimensional Covariates
This work forms this problem as a multi-armed bandit with high-dimensional covariates, and presents a new efficient bandit algorithm based on the LASSO estimator that outperforms existing bandit methods as well as physicians to correctly dose a majority of patients.