Reinforcement Learning with Immediate Rewards and Linear Hypotheses
@article{Abe2003ReinforcementL, title={Reinforcement Learning with Immediate Rewards and Linear Hypotheses}, author={Naoki Abe and Alan W. Biermann and Philip M. Long}, journal={Algorithmica}, year={2003}, volume={37}, pages={263-293}, url={https://api.semanticscholar.org/CorpusID:13804406} }
For two cases, one in which a continuous-valued reward is given by applying the unknown linear function, and another in which the probability of receiving the larger of binary-valued rewards is obtained, lower bounds are provided that show that the rate of convergence is nearly optimal.
101 Citations
No-regret Exploration in Contextual Reinforcement Learning
- 2020
Computer Science
This paper proposes and analyzes optimistic and randomized exploration methods which make (time and space) efficient online updates and demonstrates a generic template to derive confidence sets using an online learning oracle and gives a lower bound for the setting.
Orthogonal Projection in Linear Bandits
- 2019
Computer Science, Mathematics
This paper considers the case where the expected reward is an unknown linear function of a projection of the decision vector onto a subspace orthogonal to the first, and develops a strategy to achieve O(log T ) regret, where T is the number of time steps.
Contextual Markov Decision Processes using Generalized Linear Models
- 2019
Computer Science
This paper proposes a no-regret online RL algorithm in the setting where the MDP parameters are obtained from the context using generalized linear models (GLMs) and relies on efficient online updates and is also memory efficient.
Efficient Value-Function Approximation via Online Linear Regression
- 2008
Computer Science
A provably efficient, model-free RL algorithm for finite-horizon problems with linear value-function approximation that addresses the exploration-exploitation tradeoff in a principled way.
A unifying framework for computational reinforcement learning theory
- 2009
Computer Science
This thesis is that the KWIK learning model provides a flexible, modularized, and unifying way for creating and analyzing reinforcement-learning algorithms with provably efficient exploration and facilitates the development of new algorithms with smaller sample complexity, which have demonstrated empirically faster learning speed in real-world problems.
On-Line Adaptation of Exploration in the One-Armed Bandit with Covariates Problem
- 2010
Computer Science
This paper introduces a novel algorithm, e-ADAPT, which adapts as it plays and sequentially chooses whether to explore or exploit, driven by the amount of uncertainty in the system.
Parametrized stochastic multi-armed bandits with binary rewards
- 2011
Computer Science, Mathematics
An upper bound on the total regret which applies uniformly in time is shown, which shows that for any f ∈ ω(log(T), thetotal regret can be made to be O(n·f(T)), independent of the number of arms.
Neural Contextual Bandits with UCB-based Exploration
- 2020
Computer Science
A new algorithm, NeuralUCB, is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.
Randomized Exploration for Non-Stationary Stochastic Linear Bandits
- 2020
Computer Science, Mathematics
Two perturbation approaches are investigated to overcome conservatism that optimism based algorithms chronically suffer from in practice and both empirically show the outstanding performance in tackling conservatism issue that Discounted LinUCB (D-LinUCB) struggles with.
UPPER CONFIDENCE BOUND-BASED EXPLORATION
- 2019
Computer Science
The NeuralUCB algorithm is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.
26 References
Associative Reinforcement Learning using Linear Probabilistic Concepts
- 1999
Computer Science
The analysis shows that the worst-case (expected) regret for the methods is almost optimal: the upper bounds grow with the number m of trials and the number n of alternatives like O(m 3=4 n 1=2 ) and O( m 4=5 n 2=5 ), and the lower bound is.
Reinforcement Learning: An Introduction
- 1998
Computer Science
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Associative Reinforcement Learning: A Generate and Test Algorithm
- 2004
Computer Science
An algorithm is developed that performs an on-line search through the space of action mappings, expressed as Boolean formulae, that is shown to have very good performance in empirical trials.
Associative Reinforcement Learning: Functions in k-DNF
- 2004
Computer Science
Algorithms that can efficiently learn action maps that are expressible in k-DNF are developed and are shown to have very good performance.
On-line evaluation and prediction using linear functions
- 1997
Computer Science, Mathematics
A model for situations where an algorithm needs to make a sequence of choices to minimize an evaluation function, but where the evaluation function must be learned on-line as it is being used, and proves performance bounds for them that hold in the worst case.
Individual sequence prediction—upper bounds and application for complexity
- 1999
Computer Science, Mathematics
This work presents the first upper bound on the regret of the loss game that is a function of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage.
Using Confidence Bounds for Exploitation-Exploration Trade-offs
- 2002
Computer Science
It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2.
Using upper confidence bounds for online learning
- 2000
Computer Science, Mathematics
It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off and extends the results for the adversarial bandit problem to shifting bandits.
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
- 2004
Computer Science
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.
Worst-case quadratic loss bounds for prediction using linear functions and gradient descent
- 1996
Computer Science, Mathematics
Studies the performance of gradient descent (GD) when applied to the problem of online linear prediction in arbitrary inner product spaces. We prove worst-case bounds on the sum of the squared…