Reinforcement Learning 
with Immediate Rewards 
and Linear Hypotheses

N. Abe; A. Biermann; Philip M. Long

DOI:10.1007/s00453-003-1038-1
Corpus ID: 13804406

Reinforcement Learning with Immediate Rewards and Linear Hypotheses

@article{Abe2003ReinforcementL,
  title={Reinforcement Learning 
with Immediate Rewards 
and Linear Hypotheses},
  author={Naoki Abe and Alan W. Biermann and Philip M. Long},
  journal={Algorithmica},
  year={2003},
  volume={37},
  pages={263-293},
  url={https://api.semanticscholar.org/CorpusID:13804406}
}

N. AbeA. BiermannPhilip M. Long
Published in Algorithmica 30 September 2003
Computer Science

For two cases, one in which a continuous-valued reward is given by applying the unknown linear function, and another in which the probability of receiving the larger of binary-valued rewards is obtained, lower bounds are provided that show that the rate of convergence is nearly optimal.

View on Springer

101 Citations

Highly Influential Citations

Background Citations

Methods Citations

Results Citations

Topics

Feature Vector Reinforcement Learning

No-regret Exploration in Contextual Reinforcement Learning

Aditya ModiAmbuj Tewari

Computer Science

UAI

2020

This paper proposes and analyzes optimistic and randomized exploration methods which make (time and space) efficient online updates and demonstrates a generic template to derive confidence sets using an online learning oracle and gives a lower bound for the setting.

Orthogonal Projection in Linear Bandits

Qiyu KangWee Peng Tay

Computer Science, Mathematics

2019 IEEE Global Conference on Signal and…

2019

This paper considers the case where the expected reward is an unknown linear function of a projection of the decision vector onto a subspace orthogonal to the first, and develops a strategy to achieve O(log T ) regret, where T is the number of time steps.

[PDF]

Contextual Markov Decision Processes using Generalized Linear Models

Aditya ModiAmbuj Tewari

Computer Science

ArXiv

2019

This paper proposes a no-regret online RL algorithm in the setting where the MDP parameters are obtained from the context using generalized linear models (GLMs) and relies on efficient online updates and is also memory efficient.

[PDF]

Efficient Value-Function Approximation via Online Linear Regression

Lihong LiM. Littman

Computer Science

ISAIM

2008

A provably efficient, model-free RL algorithm for finite-horizon problems with linear value-function approximation that addresses the exploration-exploitation tradeoff in a principled way.

A unifying framework for computational reinforcement learning theory

M. LittmanLihong Li

Computer Science

2009

This thesis is that the KWIK learning model provides a flexible, modularized, and unifying way for creating and analyzing reinforcement-learning algorithms with provably efficient exploration and facilitates the development of new algorithms with smaller sample complexity, which have demonstrated empirically faster learning speed in real-world problems.

On-Line Adaptation of Exploration in the One-Armed Bandit with Covariates Problem

A. SykulskiN. AdamsN. Jennings

Computer Science

2010 Ninth International Conference on Machine…

2010

This paper introduces a novel algorithm, e-ADAPT, which adapts as it plays and sequentially chooses whether to explore or exploit, driven by the amount of uncertainty in the system.

Parametrized stochastic multi-armed bandits with binary rewards

Chong JiangR. Srikant

Computer Science, Mathematics

Proceedings of the 2011 American Control…

2011

An upper bound on the total regret which applies uniformly in time is shown, which shows that for any f ∈ ω(log(T), thetotal regret can be made to be O(n·f(T)), independent of the number of arms.

[PDF]

Neural Contextual Bandits with UCB-based Exploration

Dongruo ZhouLihong LiQuanquan Gu

Computer Science

ICML

2020

A new algorithm, NeuralUCB, is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.

Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Baekjin KimAmbuj Tewari

Computer Science, Mathematics

UAI

2020

Two perturbation approaches are investigated to overcome conservatism that optimism based algorithms chronically suffer from in practice and both empirically show the outstanding performance in tackling conservatism issue that Discounted LinUCB (D-LinUCB) struggles with.

[PDF]

UPPER CONFIDENCE BOUND-BASED EXPLORATION

Computer Science

2019

The NeuralUCB algorithm is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.

Highly Influenced

Associative Reinforcement Learning using Linear Probabilistic Concepts

N. AbePhilip M. Long

Computer Science

ICML

1999

The analysis shows that the worst-case (expected) regret for the methods is almost optimal: the upper bounds grow with the number m of trials and the number n of alternatives like O(m 3=4 n 1=2 ) and O( m 4=5 n 2=5 ), and the lower bound is.

Reinforcement Learning: An Introduction

R. S. SuttonA. Barto

Computer Science

IEEE Trans. Neural Networks

1998

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Associative Reinforcement Learning: A Generate and Test Algorithm

L. Kaelbling

Computer Science

Machine Learning

2004

An algorithm is developed that performs an on-line search through the space of action mappings, expressed as Boolean formulae, that is shown to have very good performance in empirical trials.

Associative Reinforcement Learning: Functions in k-DNF

L. Kaelbling

Computer Science

Machine Learning

2004

Algorithms that can efficiently learn action maps that are expressible in k-DNF are developed and are shown to have very good performance.

On-line evaluation and prediction using linear functions

Philip M. Long

Computer Science, Mathematics

COLT '97

1997

A model for situations where an algorithm needs to make a sequence of choices to minimize an evaluation function, but where the evaluation function must be learned on-line as it is being used, and proves performance bounds for them that hold in the worst case.

Individual sequence prediction—upper bounds and application for complexity

Chamy Allenberg

Computer Science, Mathematics

COLT '99

1999

This work presents the first upper bound on the regret of the loss game that is a function of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage.

Using Confidence Bounds for Exploitation-Exploration Trade-offs

P. Auer

Computer Science

J. Mach. Learn. Res.

2002

It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2.

Using upper confidence bounds for online learning

P. Auer

Computer Science, Mathematics

Proceedings 41st Annual Symposium on Foundations…

2000

It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation/exploration trade-off and extends the results for the adversarial bandit problem to shifting bandits.

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Ronald J. Williams

Computer Science

Machine Learning

2004

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.

Worst-case quadratic loss bounds for prediction using linear functions and gradient descent

N. Cesa-BianchiPhilip M. LongManfred K. Warmuth

Computer Science, Mathematics

IEEE Trans. Neural Networks

1996

Studies the performance of gradient descent (GD) when applied to the problem of online linear prediction in arbitrary inner product spaces. We prove worst-case bounds on the sum of the squared…

Reinforcement Learning with Immediate Rewards and Linear Hypotheses

Topics

101 Citations

No-regret Exploration in Contextual Reinforcement Learning

Orthogonal Projection in Linear Bandits

Contextual Markov Decision Processes using Generalized Linear Models

Efficient Value-Function Approximation via Online Linear Regression

A unifying framework for computational reinforcement learning theory

On-Line Adaptation of Exploration in the One-Armed Bandit with Covariates Problem

Parametrized stochastic multi-armed bandits with binary rewards

Neural Contextual Bandits with UCB-based Exploration

Randomized Exploration for Non-Stationary Stochastic Linear Bandits

UPPER CONFIDENCE BOUND-BASED EXPLORATION

26 References

Associative Reinforcement Learning using Linear Probabilistic Concepts

Reinforcement Learning: An Introduction

Associative Reinforcement Learning: A Generate and Test Algorithm

Associative Reinforcement Learning: Functions in k-DNF

On-line evaluation and prediction using linear functions

Individual sequence prediction—upper bounds and application for complexity

Using Confidence Bounds for Exploitation-Exploration Trade-offs

Using upper confidence bounds for online learning

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Worst-case quadratic loss bounds for prediction using linear functions and gradient descent

Related Papers