#### Filter Results:

#### Publication Year

2008

2016

#### Publication Type

#### Co-author

#### Key Phrase

#### Publication Venue

Learn More

Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their <i>gradient temporal difference</i> (GTD) algorithm converges reliably, it… (More)

A new family of gradient temporal-difference learning algorithms have recently been introduced by Sutton, Maei and others in which function approximation is much more straightforward. In this paper, we introduce the GQ(λ) algorithm which can be seen as extension of that work to a more general setting including eligibility traces and off-policy learning of… (More)

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come… (More)

We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy… (More)

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However , it is well known that off-policy sampling,… (More)

The brain is easily able to process and categorize complex time-varying signals. For example, the two sentences, "It is cold in London this time of year" and "It is hot in London this time of year," have different meanings, even though the words hot and cold appear several seconds before the ends of the two sentences. Any network that can tell these… (More)

Demand response (DR) for residential and small commercial buildings is estimated to account for as much as 65% of the total energy savings potential of DR, and previous work shows that a fully automated Energy Management System (EMS) is a necessary prerequisite to DR in these areas. In this paper, we propose a novel EMS formulation for DR problems in these… (More)

We develop an off-policy actor–critic algorithm for learning an optimal policy from a training set composed of data from multiple individuals. This algorithm is developed with a view toward its use in mobile health.

- Hamid Reza Maei, Richard S Sutton, Geoffrey J Gordon, Eric Wiewiora, Istvan Szita, Ashique Mahmood +4 others
- 2011

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these… (More)

- ‹
- 1
- ›