Learn More
In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(λ)'s updates in a particular way causes its expected update to become stable under(More)
Incremental learning algorithms based on gradient descent are effective and popular in online supervised learning, reinforcement learning, signal processing, and many other application areas. An oft-noted drawback of these algorithms is that they include a step-size parameter that needs to be tuned for best performance, which may require manual intervention(More)
Learning representations from data is one of the fundamental problems of artificial intelligence and machine learning. Many different approaches exist for learning representations, but what constitutes a good representation is not yet well understood. In this work, we view the problem of representation learning as one of learning features (e.g., hidden(More)
Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging(More)
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a particular way, these algorithms become stable and(More)
Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The parameter smoothly shifts on-policy algorithms such as TD() and Sarsa() from a pure bootstrapping form (= 0) to a pure Monte Carlo form (= 1).(More)
Importance sampling is an essential component of model-free off-policy learning algorithms. Weighted importance sampling (WIS) is generally considered superior to ordinary importance sampling but, when combined with function approximation , it has hitherto required computational complexity that is O(n 2) or more in the number of features. In this paper we(More)
The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD(λ) and true online Sarsa(λ), respectively(More)