Policy evaluation with temporal differences: a survey and comparison

  title={Policy evaluation with temporal differences: a survey and comparison},
  author={Christoph Dann and Gerhard Neumann and Jan Peters},
  journal={J. Mach. Learn. Res.},
Extended abstract of the article: Christoph Dann, Gerhard Neumann, Jan Peters (2014) Policy Evaluation with Temporal Differences: A Survey and Comparison Journal of Machine Learning Research, 15, 809-883. 

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both

Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning

It is shown empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity and that a suitable representation of the value function also stabilizes the solution to some degree.

An Adaptive Sampling Algorithm for Policy Evaluation

The empirical analysis shows that the algorithms converge to the neighbourhood of the fixed point of the projected Bellman equation faster than the respective state-of-the-art algorithms.

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

This work presents the first comprehensive empirical analysis of a broad suite of OPE methods, and offers a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research.

Stochastic Variance Reduction Methods for Policy Evaluation

This paper first transforms the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then presents a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem.

Investigating Practical Linear Temporal Difference Learning

This paper derives two new hybrid TD policy-evaluation algorithms, which fill a gap in this collection of algorithms and performs an empirical comparison to elicit which of these new linear TD methods should be preferred in different situations, and makes concrete suggestions about practical use.

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

A novel model-free OPE method is developed by introducing future-dependent value functions that take future proxies as inputs that derive a new Bellman equation for future- dependent value functions as conditional moment equations that use history proxies as instrumental variables.

Off-Policy Evaluation in Partially Observable Environments

A model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP is formulated, which shows how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general PomDPs.

Empirical Analysis of Off-Policy Policy Evaluation for Reinforcement Learning

The first comprehensive empirical analysis of most of the recently proposed OPE methods is presented, offering a summarized set of guidelines for effectively using OPE in practice, as well as suggesting directions for future research to address current limitations.

Online O↵-policy Prediction

A large empirical study of online, o↵-policy prediction methods in two challenging microworlds with fixed-basis feature representations and reports each method’s sensitivity to hyper-parameters, update variance, empirical convergence rate, and asymptotic performance, providing new insights to enable practitioners to successfully extend these new methods to challenging large-scale applications.



Neuro-Dynamic Programming

  • D. Bertsekas
  • Computer Science
    Encyclopedia of Optimization
  • 2009
From the Publisher: This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of

Introduction to Reinforcement Learning

In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning.

Convergence of Least Squares Temporal Difference Methods Under General Conditions

This work establishes for the discounted cost criterion that the off-policy LSTD(λ) converges almost surely under mild, minimal conditions and suggests a modification in its practical implementation.

Off-policy learning with eligibility traces: a survey

A comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form is described, which suggests that the most standard algorithms on and off-policy LSTD(λ) if the feature space dimension is too large for a least-squares approach--perform the best.

Least Squares Policy Evaluation Algorithms with Linear Function Approximation

A new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe is proposed, and the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].

Model Selection in Reinforcement Learning

A complexity regularization-based model selection algorithm is proposed and its adaptivity is proved : the procedure is shown to perform almost as well as if the best parameter setting was known ahead of time.

Intra-Option Learning about Temporally Abstract Actions

This paper presents intra-option learning methods for learning value functions over options and for learning multi-time models of the consequences of options and sketches a convergence proof for intraoption value learning.

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity.

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

A finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept, the approximation power of thefunction set and the controllability properties of the MDP is found.

Kalman Temporal Differences

This contribution introduces a novel approximation scheme, namely the Kalman Temporal Differences (KTD) framework, that exhibits the following features: sample-efficiency, non-linear approximation, Non-stationarity handling and uncertainty management, and Convergence is analyzed for special cases for both deterministic and stochastic transitions.