Corpus ID: 195874395

Incrementally Learning Functions of the Return

  title={Incrementally Learning Functions of the Return},
  author={Brendan Bennett and Wesley Chung and Muhammad Hamad Zaheer and Vincent Liu},
Temporal difference methods enable efficient estimation of value functions in reinforcement learning in an incremental fashion, and are of broader interest because they correspond learning as observed in biological systems. Standard value functions correspond to the expected value of a sum of discounted returns. While this formulation is often sufficient for many purposes, it would often be useful to be able to represent functions of the return as well. Unfortunately, most such functions cannot… Expand


Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods
A method for estimating the variance of the λ-return directly using policy evaluation methods from reinforcement learning is contributed, significantly simpler than prior methods that independently estimate the second moment of the €return. Expand
Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods
This paper investigates estimating the variance of a temporal-difference learning agent's update target using policy evaluation methods from reinforcement learning, contributing a method significantly simpler than prior methods that independently estimate the second moment of the {\lambda}-return. Expand
A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning
A novel objective function for optimizing $\lambda$ as a function of state rather than time is contributed, which represents a concrete step towards black-box application of temporal-difference learning methods in real world problems. Expand
Learning the Variance of the Reward-To-Go
This paper proposes variants of both TD(0) and LSTD(λ) with linear function approximation, proves their convergence, and proves their utility in an option pricing problem, and shows a dramatic improvement in terms of sample efficiency over standard Monte-Carlo methods. Expand
Improving Generalization for Temporal Difference Learning: The Successor Representation
  • P. Dayan
  • Computer Science
  • Neural Computation
  • 1993
This paper shows how TD machinery can be used to learn good function approximators or representations, and illustrates, using a navigation task, the appropriately distributed nature of the result. Expand
Reinforcement Learning with Unsupervised Auxiliary Tasks
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth. Expand
Reinforcement Learning: An Introduction
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications. Expand
Influence and variance of a Markov chain: application to adaptive discretization in optimal control
  • R. Munos, A. Moore
  • Mathematics
  • Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304)
  • 1999
This paper addresses the difficult problem of deciding where to refine the resolution of adaptive discretizations for solving continuous time-and-space deterministic optimal control problems. WeExpand
Temporal Difference Models and Reward-Related Learning in the Human Brain
Regression analyses revealed that responses in ventral striatum and orbitofrontal cortex were significantly correlated with this prediction error signal, suggesting that, during appetitive conditioning, computations described by temporal difference learning are expressed in the human brain. Expand
The variance of discounted Markov decision processes
Formulae are presented for the variance and higher moments of the present value of single-stage rewards in a finite Markov decision process. Similar formulae are exhibited for a semi-Markov decisionExpand