• Corpus ID: 166228690

Faster and More Accurate Trace-based Policy Evaluation via Overall Target Error Meta-Optimization

  title={Faster and More Accurate Trace-based Policy Evaluation via Overall Target Error Meta-Optimization},
  author={Mingde Zhao and Ian Porada and Sitao Luan and Xiao-Wen Chang and Doina Precup},
To improve the speed and accuracy of the trace based policy evaluation method TD(λ), under appropriate assumptions, we derive and propose an off-policy compatible method of meta-learning state-based λ’s online with efficient incremental updates. Furthermore, we prove the derived bias-variance tradeoff minimization method, with slight adjustments, is equivalent to minimizing the overall target error in terms of state based λ’s. In experiments, the method shows significantly better performance… 

Figures and Tables from this paper



Bias-Variance Error Bounds for Temporal Difference Updates

The authors' bounds give formal verification to the well-known intuition that td methods are subject to a bias-variance tradeoff, and they lead to schedules for k and that are predicted to be better than any fixed values for these parameters.

Off-policy TD( l) with a true online equivalence

This work generalizes Van Seijen and Sutton's equivalence result and uses this generalization to construct the first online algorithm to be exactly equivalent to an off-policy forward view, which empirically outperforms GTD(λ) (Maei, 2011) which was derived from the same objective but lacks the exact online equivalence.

A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning

A novel objective function for optimizing $\lambda$ as a function of state rather than time is contributed, which represents a concrete step towards black-box application of temporal-difference learning methods in real world problems.

Fast gradient-descent methods for temporal-difference learning with linear function approximation

Two new related algorithms with better convergence rates are introduced: the first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD).

Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods

A method for estimating the variance of the λ-return directly using policy evaluation methods from reinforcement learning is contributed, significantly simpler than prior methods that independently estimate the second moment of the €return.

Reinforcement Learning with Replacing Eligibility Traces

This paper introduces a new kind of eligibility trace, the replacing trace, analyze it theoretically, and shows that it results in faster, more reliable learning than the conventional trace, and significantly improves performance and reduces parameter sensitivity on the "Mountain-Car" task.

Adaptive Step-Size for Online Temporal Difference Learning

An adaptive upper bound on the step-size parameter is derived to guarantee that online TD learning with linear function approximation will not diverge and this effectively eliminates the need to tune the learning rate of temporal difference learning withlinear function approximation.

Analytical Mean Squared Error Curves for Temporal Difference Learning

It is shown that although the various temporal difference algorithms are quite sensitive to the choice of step-size and eligibility-trace parameters, there are values of these parameters that make them similarly competent, and generally good.

On the Worst-Case Analysis of Temporal-Difference Learning Algorithms

Lower bounds on the performance of any algorithm for this learning problem are proved, and a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement is given.

Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction

Results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience are presented.