• Corpus ID: 195346174

# Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods

@article{Sherstan2018DirectlyET,
title={Directly Estimating the Variance of the $\lambda$-Return Using Temporal-Difference Methods},
author={Craig Sherstan and Brendan Bennett and K. Young and Dylan R. Ashley and Adam White and Martha White and Richard S. Sutton},
journal={ArXiv},
year={2018},
volume={abs/1801.08287}
}
• Published 25 January 2018
• Computer Science
• ArXiv
This paper investigates estimating the variance of a temporal-difference learning agent’s update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the…

## Figures from this paper

A Natural Actor-Critic Algorithm with Downside Risk Constraints
• Computer Science
ArXiv
• 2020
A new Bellman equation is introduced that upper bounds the lower partial moment, circumventing its non-linearity and providing intuition into the stability of the algorithm by variance decomposition, which allows sample-efficient, on-line estimation of partial moments.
Incrementally Learning Functions of the Return
• Economics
ArXiv
• 2019
This work proposes a means of estimating functions of the return using its moments, which can be learned online using a modified TD algorithm and used as part of a Taylor expansion to approximate analytic functions ofThe return.
META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning
This work proposes a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner, and proves that the proposed method improves the overall quality of the update targets, by minimizing the overall target error.
META-Learning State-based Eligibility Traces for More Sample-Efficient Policy Evaluation
• Computer Science
AAMAS
• 2020
This work proposes a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner, and proves that the proposed method improves the overall quality of the update targets, by minimizing the overall target error.
Safe option-critic: learning safety in the option-critic architecture
• Computer Science
The Knowledge Engineering Review
• 2021
This work considers a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions and proposes an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency.
Novelty-Guided Reinforcement Learning via Encoded Behaviors
• Computer Science, Psychology
2020 International Joint Conference on Neural Networks (IJCNN)
• 2020
A function approximation paradigm to instead learn sparse representations of agent behaviors using auto-encoders, which are later used to assign novelty scores to policies, which suggest that this way of novelty-guided exploration is a viable alternative to classic novelty search methods.
Faster and More Accurate Trace-based Policy Evaluation via Overall Target Error Meta-Optimization
• Computer Science
• 2019
It is proved the derived bias-variance tradeoff minimization method, with slight adjustments, is equivalent to minimizing the overall target error in terms of state based λ’s.
Evolutionary Reinforcement Learning
• Computer Science
NIPS 2018
• 2018
Evolutionary Reinforcement Learning (ERL), a hybrid algorithm that leverages the population of an EA to provide diversified data to train an RL agent, and reinserts the RL agent into theEA population periodically to inject gradient information into the EA.
Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation
• Computer Science
2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
• 2018
It is shown that using the Successor Representation can improve sample efficiency and learning speed of GVFs in a continual learning setting where new predictions are incrementally added and learned over time.
Prediction, Knowledge, and Explainability: Examining the Use of General Value Functions in Machine Knowledge
• Computer Science
Frontiers in Artificial Intelligence
• 2022
It is proposed that prior to explaining its decisions to others, an self-supervised agent must be able to introspectively explain decisions to itself, and it is demonstrated that by making their subjective explanations public, predictive knowledge agents can improve the clarity of their operation in collaborative tasks.

## References

SHOWING 1-10 OF 17 REFERENCES
Learning the Variance of the Reward-To-Go
• Computer Science
J. Mach. Learn. Res.
• 2016
This paper proposes variants of both TD(0) and LSTD(λ) with linear function approximation, proves their convergence, and proves their utility in an option pricing problem, and shows a dramatic improvement in terms of sample efficiency over standard Monte-Carlo methods.
A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning
• Computer Science
AAMAS
• 2016
A novel objective function for optimizing $\lambda$ as a function of state rather than time is contributed, which represents a concrete step towards black-box application of temporal-difference learning methods in real world problems.
TD algorithm for the variance of return and mean-variance reinforcement learning
• Computer Science
• 2001
A TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control are presented.
Learning to Predict by the Methods of Temporal Differences
This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior – and proves their convergence and optimality for special cases and relation to supervised-learning methods.
Fast gradient-descent methods for temporal-difference learning with linear function approximation
• Computer Science
ICML '09
• 2009
Two new related algorithms with better convergence rates are introduced: the first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD).
Actor-Critic Algorithms for Risk-Sensitive MDPs
• Computer Science
NIPS
• 2013
This paper considers both discounted and average reward Markov decision processes and devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction, which establish the convergence of the algorithms to locally risk-sensitive optimal policies.
On Convergence of Emphatic Temporal-Difference Learning
This paper presents the first convergence proofs for two emphatic algorithms, ETD and ELSTD, and proves, under general off-policy conditions, the convergence in $L^1$ for ELSTD iterates, and the almost sure convergence of the approximate value functions calculated by both algorithms using a single infinitely long trajectory.
Policy Gradients with Variance Related Risk Criteria
• Computer Science
ICML
• 2012
A framework for local policy gradient style algorithms for reinforcement learning for variance related criteria for policy gradient algorithms for criteria that involve both the expected cost and the variance of the cost.
Unifying Task Specification in Reinforcement Learning
This work introduces the RL task formalism, that provides a unification through simple constructs including a generalization to transition-based discounting and extends standard learning constructs, including Bellman operators, and extends some seminal theoretical results, including approximation errors bounds.