# Reinforcement Learning with Immediate Rewards and Linear Hypotheses

@article{Abe2003ReinforcementL, title={Reinforcement Learning with Immediate Rewards and Linear Hypotheses}, author={N. Abe and A. Biermann and Philip M. Long}, journal={Algorithmica}, year={2003}, volume={37}, pages={263-293} }

Abstract
We consider the design and analysis of algorithms that learn from the
consequences of their actions
with the goal of maximizing their cumulative reward, when the consequence of a given action is felt immediately, and
a linear function, which is unknown a priori, (approximately)
relates a feature vector for each action/state pair to the (expected)
associated reward.
We focus on two cases, one in which a continuous-valued reward is
(approximately) given by applying the unknown linearâ€¦Â Expand

#### Figures and Topics from this paper

#### 76 Citations

Orthogonal Projection in Linear Bandits

- Mathematics, Computer Science
- 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
- 2019

This paper considers the case where the expected reward is an unknown linear function of a projection of the decision vector onto a subspace orthogonal to the first, and develops a strategy to achieve O(log T ) regret, where T is the number of time steps. Expand

Contextual Markov Decision Processes using Generalized Linear Models

- Computer Science, Mathematics
- ArXiv
- 2019

This paper proposes a no-regret online RL algorithm in the setting where the MDP parameters are obtained from the context using generalized linear models (GLMs) and relies on efficient online updates and is also memory efficient. Expand

Efficient Value-Function Approximation via Online Linear Regression

- Computer Science
- ISAIM
- 2008

A provably efficient, model-free RL algorithm for finite-horizon problems with linear value-function approximation that addresses the exploration-exploitation tradeoff in a principled way. Expand

A unifying framework for computational reinforcement learning theory

- Computer Science
- 2009

This thesis is that the KWIK learning model provides a flexible, modularized, and unifying way for creating and analyzing reinforcement-learning algorithms with provably efficient exploration and facilitates the development of new algorithms with smaller sample complexity, which have demonstrated empirically faster learning speed in real-world problems. Expand

On-Line Adaptation of Exploration in the One-Armed Bandit with Covariates Problem

- Computer Science
- 2010 Ninth International Conference on Machine Learning and Applications
- 2010

This paper introduces a novel algorithm, e-ADAPT, which adapts as it plays and sequentially chooses whether to explore or exploit, driven by the amount of uncertainty in the system. Expand

Parametrized stochastic multi-armed bandits with binary rewards

- Mathematics, Computer Science
- Proceedings of the 2011 American Control Conference
- 2011

An upper bound on the total regret which applies uniformly in time is shown, which shows that for any f âˆˆ Ï‰(log(T), thetotal regret can be made to be O(nÂ·f(T)), independent of the number of arms. Expand

UPPER CONFIDENCE BOUND-BASED EXPLORATION

- 2019

We study the stochastic contextual bandit problem, where the reward is generated from an unknown bounded function with additive noise. We propose the NeuralUCB algorithm, which leverages theâ€¦ Expand

New models and algorithms for bandits and markets

- Computer Science
- 2015

This dissertation will develop a theory for bandit problems with structured rewards that permit a graphical model representation, and give an efficient algorithm for regretminimization in such a setting, and develop a deeper connection between online supervised learning and regret- Minimization. Expand

Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective

- Computer Science, Mathematics
- COLT
- 2021

A family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds for contextual bandits are introduced and new oracle-efficient algorithms which adapt to the gap whenever possible are introduced, while also attaining the minimax rate in the worst case. Expand

Multi-Armed Bandits with Censored Consumption of Resources

- Computer Science, Mathematics
- ArXiv
- 2020

A measure of regret is introduced, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards, and is derived on the cumulative regret and proposed a learning algorithm having a regret upper bound that matches the lower bound. Expand

#### References

SHOWING 1-10 OF 27 REFERENCES

Associative Reinforcement Learning using Linear Probabilistic Concepts

- Computer Science
- ICML
- 1999

The analysis shows that the worst-case (expected) regret for the methods is almost optimal: the upper bounds grow with the number m of trials and the number n of alternatives like O(m 3=4 n 1=2 ) and O( m 4=5 n 2=5 ), and the lower bound is. Expand

Reinforcement Learning: An Introduction

- Computer Science
- IEEE Transactions on Neural Networks
- 2005

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications. Expand

Associative reinforcement learning: A generate and test algorithm

- Machine Learning
- 2004

An agent that must learn to act in the world by trial and error faces thereinforcement learning problem, which is quite different from standard concept learning. Although good algorithms exist forâ€¦ Expand

Associative Reinforcement Learning: Functions in k-DNF

- Mathematics, Computer Science
- Machine Learning
- 2004

Algorithms that can efficiently learn action maps that are expressible in k-DNF are developed and are shown to have very good performance. Expand

On-line evaluation and prediction using linear functions

- Mathematics, Computer Science
- COLT '97
- 1997

A model for situations where an algorithm needs to make a sequence of choices to minimize an evaluation function, but where the evaluation function must be learned on-line as it is being used, and proves performance bounds for them that hold in the worst case. Expand

Individual sequence predictionâ€”upper bounds and application for complexity

- Mathematics, Computer Science
- COLT '99
- 1999

This work presents the first upper bound on the regret of the loss game that is a function of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage. Expand

Using Confidence Bounds for Exploitation-Exploration Trade-offs

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2002

It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2. Expand

Using upper confidence bounds for online learning

- Computer Science
- Proceedings 41st Annual Symposium on Foundations of Computer Science
- 2000

It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off and extends the results for the adversarial bandit problem to shifting bandits. Expand

Simple statistical gradient-following algorithms for connectionist reinforcement learning

- Computer Science
- Machine Learning
- 2004

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates. Expand

Worst-case quadratic loss bounds for prediction using linear functions and gradient descent

- Mathematics, Computer Science
- IEEE Trans. Neural Networks
- 1996

Studies the performance of gradient descent (GD) when applied to the problem of online linear prediction in arbitrary inner product spaces. We prove worst-case bounds on the sum of the squaredâ€¦ Expand