# Unifying Task Specification in Reinforcement Learning

@article{White2017UnifyingTS, title={Unifying Task Specification in Reinforcement Learning}, author={Martha White}, journal={ArXiv}, year={2017}, volume={abs/1609.01995} }

Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications often couple the dynamics of the environment and the learning objective. This lack of modularity can complicate generalization of the task specification, as well as obfuscate connections between different task settings, such as episodic and continuing. In this work, we introduce the RL task formalism, that provides a unification through simple…

## 64 Citations

Continual Auxiliary Task Learning

- Computer ScienceNeurIPS
- 2021

This work investigates a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions, and develops an algorithm based on successor features that facilitates tracking under non-stationary rewards.

On the Expressivity of Markov Reward

- Computer Science, PsychologyNeurIPS
- 2021

This paper provides a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists.

Goal-Space Planning with Subgoal Models

- Computer ScienceArXiv
- 2022

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture and shows that the GSP algorithm can learn signiﬁcantly faster than a Double DQN baseline in a variety of situations.

Time Limits in Reinforcement Learning

- Computer ScienceICML
- 2018

This paper provides a formal account for how time limits could effectively be handled in each of the two cases and explains why not doing so can cause state-aliasing and invalidation of experience replay, leading to suboptimal policies and training instability.

Steady State Analysis of Episodic Reinforcement Learning

- Computer ScienceNeurIPS 2020
- 2020

This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's…

The Paradox of Choice: Using Attention in Hierarchical Reinforcement Learning

- EconomicsArXiv
- 2022

Decision-making AI agents are often faced with two important challenges: the depth of the planning horizon, and the branching factor due to having many choices. Hierarchical reinforcement learning…

Transition Based Discount Factor for Model Free Algorithms in Reinforcement Learning

- Computer ScienceSymmetry
- 2021

This study introduces and analyses a transition-based discount factor in two model-free reinforcement learning algorithms: Q-learning and SARSA, and shows their convergence using the theory of stochastic approximation for finite state and action spaces.

Generalizing Value Estimation over Timescale

- Psychology
- 2018

General value functions (GVFs) are an approach to representing models of an agent’s world as a collection of predictive questions. A GVF is defined by: a policy, a prediction target, and a timescale.…

Gamma-Nets: Generalizing Value Estimation over Timescale

- Computer ScienceAAAI
- 2020

$Gamma-nets provide a method for compactly making predictions at many timescales without requiring a priori knowledge of the task, making it a valuable contribution to ongoing work on model-based planning, representation learning, and lifelong learning algorithms.

Representation and General Value Functions

- Computer Science
- 2020

This dissertation proposes using internally generated signals as cumulants for introspective GVFs and argues that such predictions can enhance an agent’s state representation, and introduces Γ-nets, which enable a single GVF estimator to make predictions for any fixed timescale within the training bounds, improving the tractability of learning and representing vast numbers of predictions.

## References

SHOWING 1-10 OF 31 REFERENCES

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning

- Computer ScienceArtif. Intell.
- 1999

TD Models: Modeling the World at a Mixture of Time Scales

- Computer ScienceICML
- 1995

Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction

- Computer ScienceAAMAS
- 2011

Results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience are presented.

Multi-timescale nexting in a reinforcement learning robot

- Computer ScienceAdapt. Behav.
- 2014

This paper presents results with a robot that learns to next in real time, making thousands of predictions about sensory input signals at timescales from 0.1 to 8 seconds, and extends nexting beyond simple timescale by letting the discount rate be a function of the state.

An object-oriented representation for efficient reinforcement learning

- Computer ScienceICML '08
- 2008

Object-Oriented MDPs (OO-MDPs) are introduced, a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities and a polynomial bound on its sample complexity is proved.

A new Q(lambda) with interim forward view and Monte Carlo equivalence

- Computer ScienceICML
- 2014

A new version of Q(λ) is introduced that does exactly that, without significantly increased algorithmic complexity, and introduces a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ), but extended to apply at every time step rather than only at the end of episodes.

The Dependence of Effective Planning Horizon on Model Accuracy

- EconomicsAAMAS
- 2015

It is shown formally that the planning horizon is a complexity control parameter for the class of policies to be learned and has an intuitive, monotonic relationship with a simple counting measure of complexity, and that a similar relationship can be observed empirically with a more general and data-dependent Rademacher complexity measure.

GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces

- Computer Science
- 2010

The GQ(λ) algorithm is introduced which can be seen as extension of that work to a more general setting including eligibility traces and off-policy learning of temporally abstract predictions.

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

- Computer Science, MathematicsAAAI
- 2016

A generalization of the recently introduced emphatic temporal differences (ETD) algorithm, which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases, is proposed, where the introduced parameter β controls the decay rate of an importance-sampling term.

True online TD(λ)

- Computer ScienceICML 2014
- 2014

This paper introduces a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it, and uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces.