Reinforcement Learning with Immediate Rewards and Linear Hypotheses

@article{Abe2003ReinforcementL,
  title={Reinforcement Learning 
with Immediate Rewards 
and Linear Hypotheses},
  author={N. Abe and A. Biermann and Philip M. Long},
  journal={Algorithmica},
  year={2003},
  volume={37},
  pages={263-293}
}
Abstract We consider the design and analysis of algorithms that learn from the consequences of their actions with the goal of maximizing their cumulative reward, when the consequence of a given action is felt immediately, and a linear function, which is unknown a priori, (approximately) relates a feature vector for each action/state pair to the (expected) associated reward. We focus on two cases, one in which a continuous-valued reward is (approximately) given by applying the unknown linear… Expand
Orthogonal Projection in Linear Bandits
  • Qiyu Kang, Wee Peng Tay
  • Mathematics, Computer Science
  • 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
  • 2019
TLDR
This paper considers the case where the expected reward is an unknown linear function of a projection of the decision vector onto a subspace orthogonal to the first, and develops a strategy to achieve O(log T ) regret, where T is the number of time steps. Expand
Contextual Markov Decision Processes using Generalized Linear Models
TLDR
This paper proposes a no-regret online RL algorithm in the setting where the MDP parameters are obtained from the context using generalized linear models (GLMs) and relies on efficient online updates and is also memory efficient. Expand
Efficient Value-Function Approximation via Online Linear Regression
TLDR
A provably efficient, model-free RL algorithm for finite-horizon problems with linear value-function approximation that addresses the exploration-exploitation tradeoff in a principled way. Expand
A unifying framework for computational reinforcement learning theory
TLDR
This thesis is that the KWIK learning model provides a flexible, modularized, and unifying way for creating and analyzing reinforcement-learning algorithms with provably efficient exploration and facilitates the development of new algorithms with smaller sample complexity, which have demonstrated empirically faster learning speed in real-world problems. Expand
On-Line Adaptation of Exploration in the One-Armed Bandit with Covariates Problem
TLDR
This paper introduces a novel algorithm, e-ADAPT, which adapts as it plays and sequentially chooses whether to explore or exploit, driven by the amount of uncertainty in the system. Expand
Parametrized stochastic multi-armed bandits with binary rewards
  • Chong Jiang, R. Srikant
  • Mathematics, Computer Science
  • Proceedings of the 2011 American Control Conference
  • 2011
TLDR
An upper bound on the total regret which applies uniformly in time is shown, which shows that for any f ∈ ω(log(T), thetotal regret can be made to be O(n·f(T)), independent of the number of arms. Expand
UPPER CONFIDENCE BOUND-BASED EXPLORATION
  • 2019
We study the stochastic contextual bandit problem, where the reward is generated from an unknown bounded function with additive noise. We propose the NeuralUCB algorithm, which leverages theExpand
New models and algorithms for bandits and markets
TLDR
This dissertation will develop a theory for bandit problems with structured rewards that permit a graphical model representation, and give an efficient algorithm for regretminimization in such a setting, and develop a deeper connection between online supervised learning and regret- Minimization. Expand
Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective
TLDR
A family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds for contextual bandits are introduced and new oracle-efficient algorithms which adapt to the gap whenever possible are introduced, while also attaining the minimax rate in the worst case. Expand
Multi-Armed Bandits with Censored Consumption of Resources
TLDR
A measure of regret is introduced, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards, and is derived on the cumulative regret and proposed a learning algorithm having a regret upper bound that matches the lower bound. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
Associative Reinforcement Learning using Linear Probabilistic Concepts
TLDR
The analysis shows that the worst-case (expected) regret for the methods is almost optimal: the upper bounds grow with the number m of trials and the number n of alternatives like O(m 3=4 n 1=2 ) and O( m 4=5 n 2=5 ), and the lower bound is. Expand
Reinforcement Learning: An Introduction
TLDR
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications. Expand
Associative reinforcement learning: A generate and test algorithm
An agent that must learn to act in the world by trial and error faces thereinforcement learning problem, which is quite different from standard concept learning. Although good algorithms exist forExpand
Associative Reinforcement Learning: Functions in k-DNF
  • L. Kaelbling
  • Mathematics, Computer Science
  • Machine Learning
  • 2004
TLDR
Algorithms that can efficiently learn action maps that are expressible in k-DNF are developed and are shown to have very good performance. Expand
On-line evaluation and prediction using linear functions
TLDR
A model for situations where an algorithm needs to make a sequence of choices to minimize an evaluation function, but where the evaluation function must be learned on-line as it is being used, and proves performance bounds for them that hold in the worst case. Expand
Individual sequence prediction—upper bounds and application for complexity
TLDR
This work presents the first upper bound on the regret of the loss game that is a function of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage. Expand
Using Confidence Bounds for Exploitation-Exploration Trade-offs
  • P. Auer
  • Mathematics, Computer Science
  • J. Mach. Learn. Res.
  • 2002
TLDR
It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2. Expand
Using upper confidence bounds for online learning
  • P. Auer
  • Computer Science
  • Proceedings 41st Annual Symposium on Foundations of Computer Science
  • 2000
TLDR
It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off and extends the results for the adversarial bandit problem to shifting bandits. Expand
Simple statistical gradient-following algorithms for connectionist reinforcement learning
TLDR
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates. Expand
Worst-case quadratic loss bounds for prediction using linear functions and gradient descent
Studies the performance of gradient descent (GD) when applied to the problem of online linear prediction in arbitrary inner product spaces. We prove worst-case bounds on the sum of the squaredExpand
...
1
2
3
...