Corpus ID: 9594186

# Policy evaluation with temporal differences: a survey and comparison

@article{Dann2014PolicyEW,
title={Policy evaluation with temporal differences: a survey and comparison},
author={Christoph Dann and G. Neumann and Jan Peters},
journal={J. Mach. Learn. Res.},
year={2014},
volume={15},
pages={809-883}
}
• Published 2014
• Mathematics, Computer Science
• J. Mach. Learn. Res.
Extended abstract of the article: Christoph Dann, Gerhard Neumann, Jan Peters (2014) Policy Evaluation with Temporal Differences: A Survey and Comparison Journal of Machine Learning Research, 15, 809-883.
156 Citations
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis
• Mathematics, Computer Science
• ArXiv
• 2020
We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish bothExpand
Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning
• Computer Science, Mathematics
• ArXiv
• 2016
It is shown empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity and that a suitable representation of the value function also stabilizes the solution to some degree. Expand
An Adaptive Sampling Algorithm for Policy Evaluation
• Computer Science
• 2019 Fifth Indian Control Conference (ICC)
• 2019
The empirical analysis shows that the algorithms converge to the neighbourhood of the fixed point of the projected Bellman equation faster than the respective state-of-the-art algorithms. Expand
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
• Computer Science, Mathematics
• ArXiv
• 2019
This work presents the first comprehensive empirical analysis of a broad suite of OPE methods, and offers a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research. Expand
Stochastic Variance Reduction Methods for Policy Evaluation
• Computer Science, Mathematics
• ICML
• 2017
This paper first transforms the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then presents a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. Expand
Investigating Practical Linear Temporal Difference Learning
• Computer Science, Mathematics
• AAMAS
• 2016
This paper derives two new hybrid TD policy-evaluation algorithms, which fill a gap in this collection of algorithms and performs an empirical comparison to elicit which of these new linear TD methods should be preferred in different situations, and makes concrete suggestions about practical use. Expand
Off-Policy Evaluation in Partially Observable Environments
• Computer Science, Engineering
• AAAI
• 2020
A model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP is formulated, which shows how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general PomDPs. Expand
Empirical Analysis of Off-Policy Policy Evaluation for Reinforcement Learning
Off-policy policy evaluation (OPE) is the task of predicting the online performance of a policy using only pre-collected historical data (collected from an existing deployed policy or set ofExpand
Online O↵-policy Prediction
This paper investigates the problem of online prediction learning, where prediction, action, and learning proceed continuously as the agent interacts with an unknown environment. The predictions madeExpand
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
• Mathematics, Computer Science
• J. Mach. Learn. Res.
• 2016
It is shown that varying the emphasis of linear TD(γ)'s updates in a particular way causes its expected update to become stable under off-policy training. Expand

#### References

SHOWING 1-10 OF 98 REFERENCES
Neuro-Dynamic Programming
• Computer Science, Economics
• Encyclopedia of Machine Learning
• 1996
From the Publisher: This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application ofExpand
Convergence of Least Squares Temporal Difference Methods Under General Conditions
This work establishes for the discounted cost criterion that the off-policy LSTD(λ) converges almost surely under mild, minimal conditions and suggests a modification in its practical implementation. Expand
Introduction to Reinforcement Learning
• Computer Science
• 1998
In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Expand
Off-policy learning with eligibility traces: a survey
• Computer Science
• J. Mach. Learn. Res.
• 2014
A comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form is described, which suggests that the most standard algorithms on and off-policy LSTD(λ) if the feature space dimension is too large for a least-squares approach--perform the best. Expand
Generalized polynomial approximations in Markovian decision processes
• Mathematics
• 1985
Abstract Fitting the value function in a Markovian decision process by a linear superposition of M basis functions reduces the problem dimensionality from the number of states down to M , with goodExpand
Least Squares Policy Evaluation Algorithms with Linear Function Approximation
• Mathematics, Computer Science
• Discret. Event Dyn. Syst.
• 2003
A new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe is proposed, and the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1]. Expand
Model Selection in Reinforcement Learning
• Mathematics, Computer Science
• Machine Learning
• 2010
A complexity regularization-based model selection algorithm is proposed and its adaptivity is proved : the procedure is shown to perform almost as well as if the best parameter setting was known ahead of time. Expand
Intra-Option Learning about Temporally Abstract Actions
• Computer Science
• ICML
• 1998
This paper presents intra-option learning methods for learning value functions over options and for learning multi-time models of the consequences of options and sketches a convergence proof for intraoption value learning. Expand
A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
• Computer Science, Mathematics
• NIPS
• 2008
The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. Expand
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
• Computer Science
• ICML
• 2011
PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way by learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning. Expand