# Policy evaluation with temporal differences: a survey and comparison

@article{Dann2014PolicyEW, title={Policy evaluation with temporal differences: a survey and comparison}, author={Christoph Dann and G. Neumann and Jan Peters}, journal={J. Mach. Learn. Res.}, year={2014}, volume={15}, pages={809-883} }

Extended abstract of the article: Christoph Dann, Gerhard Neumann, Jan Peters (2014) Policy Evaluation with Temporal Differences: A Survey and Comparison Journal of Machine Learning Research, 15, 809-883.

#### Topics from this paper

#### 156 Citations

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

- Mathematics, Computer Science
- ArXiv
- 2020

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both… Expand

Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2016

It is shown empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity and that a suitable representation of the value function also stabilizes the solution to some degree. Expand

An Adaptive Sampling Algorithm for Policy Evaluation

- Computer Science
- 2019 Fifth Indian Control Conference (ICC)
- 2019

The empirical analysis shows that the algorithms converge to the neighbourhood of the fixed point of the projected Bellman equation faster than the respective state-of-the-art algorithms. Expand

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2019

This work presents the first comprehensive empirical analysis of a broad suite of OPE methods, and offers a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research. Expand

Stochastic Variance Reduction Methods for Policy Evaluation

- Computer Science, Mathematics
- ICML
- 2017

This paper first transforms the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then presents a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. Expand

Investigating Practical Linear Temporal Difference Learning

- Computer Science, Mathematics
- AAMAS
- 2016

This paper derives two new hybrid TD policy-evaluation algorithms, which fill a gap in this collection of algorithms and performs an empirical comparison to elicit which of these new linear TD methods should be preferred in different situations, and makes concrete suggestions about practical use. Expand

Off-Policy Evaluation in Partially Observable Environments

- Computer Science, Engineering
- AAAI
- 2020

A model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP is formulated, which shows how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general PomDPs. Expand

Empirical Analysis of Off-Policy Policy Evaluation for Reinforcement Learning

- 2019

Off-policy policy evaluation (OPE) is the task of predicting the online performance of a policy using only pre-collected historical data (collected from an existing deployed policy or set of… Expand

Online O↵-policy Prediction

- 2020

This paper investigates the problem of online prediction learning, where prediction, action, and learning proceed continuously as the agent interacts with an unknown environment. The predictions made… Expand

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2016

It is shown that varying the emphasis of linear TD(γ)'s updates in a particular way causes its expected update to become stable under off-policy training. Expand

#### References

SHOWING 1-10 OF 98 REFERENCES

Neuro-Dynamic Programming

- Computer Science, Economics
- Encyclopedia of Machine Learning
- 1996

From the Publisher:
This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of… Expand

Convergence of Least Squares Temporal Difference Methods Under General Conditions

- Mathematics, Computer Science
- ICML
- 2010

This work establishes for the discounted cost criterion that the off-policy LSTD(λ) converges almost surely under mild, minimal conditions and suggests a modification in its practical implementation. Expand

Introduction to Reinforcement Learning

- Computer Science
- 1998

In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Expand

Off-policy learning with eligibility traces: a survey

- Computer Science
- J. Mach. Learn. Res.
- 2014

A comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form is described, which suggests that the most standard algorithms on and off-policy LSTD(λ) if the feature space dimension is too large for a least-squares approach--perform the best. Expand

Generalized polynomial approximations in Markovian decision processes

- Mathematics
- 1985

Abstract Fitting the value function in a Markovian decision process by a linear superposition of M basis functions reduces the problem dimensionality from the number of states down to M , with good… Expand

Least Squares Policy Evaluation Algorithms with Linear Function Approximation

- Mathematics, Computer Science
- Discret. Event Dyn. Syst.
- 2003

A new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe is proposed, and the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1]. Expand

Model Selection in Reinforcement Learning

- Mathematics, Computer Science
- Machine Learning
- 2010

A complexity regularization-based model selection algorithm is proposed and its adaptivity is proved : the procedure is shown to perform almost as well as if the best parameter setting was known ahead of time. Expand

Intra-Option Learning about Temporally Abstract Actions

- Computer Science
- ICML
- 1998

This paper presents intra-option learning methods for learning value functions over options and for learning multi-time models of the consequences of options and sketches a convergence proof for intraoption value learning. Expand

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

- Computer Science, Mathematics
- NIPS
- 2008

The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. Expand

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

- Computer Science
- ICML
- 2011

PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way by learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning. Expand