• Corpus ID: 570214

Unifying Task Specification in Reinforcement Learning

  title={Unifying Task Specification in Reinforcement Learning},
  author={Martha White},
Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications often couple the dynamics of the environment and the learning objective. This lack of modularity can complicate generalization of the task specification, as well as obfuscate connections between different task settings, such as episodic and continuing. In this work, we introduce the RL task formalism, that provides a unification through simple… 

Figures and Tables from this paper

Continual Auxiliary Task Learning
This work investigates a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions, and develops an algorithm based on successor features that facilitates tracking under non-stationary rewards.
On the Expressivity of Markov Reward
This paper provides a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists.
Goal-Space Planning with Subgoal Models
This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture and shows that the GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.
Time Limits in Reinforcement Learning
This paper provides a formal account for how time limits could effectively be handled in each of the two cases and explains why not doing so can cause state-aliasing and invalidation of experience replay, leading to suboptimal policies and training instability.
Steady State Analysis of Episodic Reinforcement Learning
This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's
The Paradox of Choice: Using Attention in Hierarchical Reinforcement Learning
Decision-making AI agents are often faced with two important challenges: the depth of the planning horizon, and the branching factor due to having many choices. Hierarchical reinforcement learning
Transition Based Discount Factor for Model Free Algorithms in Reinforcement Learning
This study introduces and analyses a transition-based discount factor in two model-free reinforcement learning algorithms: Q-learning and SARSA, and shows their convergence using the theory of stochastic approximation for finite state and action spaces.
Generalizing Value Estimation over Timescale
General value functions (GVFs) are an approach to representing models of an agent’s world as a collection of predictive questions. A GVF is defined by: a policy, a prediction target, and a timescale.
Gamma-Nets: Generalizing Value Estimation over Timescale
$Gamma-nets provide a method for compactly making predictions at many timescales without requiring a priori knowledge of the task, making it a valuable contribution to ongoing work on model-based planning, representation learning, and lifelong learning algorithms.
Representation and General Value Functions
This dissertation proposes using internally generated signals as cumulants for introspective GVFs and argues that such predictions can enhance an agent’s state representation, and introduces Γ-nets, which enable a single GVF estimator to make predictions for any fixed timescale within the training bounds, improving the tractability of learning and representing vast numbers of predictions.


TD Models: Modeling the World at a Mixture of Time Scales
Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction
Results using Horde on a multi-sensored mobile robot to successfully learn goal-oriented behaviors and long-term predictions from off-policy experience are presented.
Multi-timescale nexting in a reinforcement learning robot
This paper presents results with a robot that learns to next in real time, making thousands of predictions about sensory input signals at timescales from 0.1 to 8 seconds, and extends nexting beyond simple timescale by letting the discount rate be a function of the state.
An object-oriented representation for efficient reinforcement learning
Object-Oriented MDPs (OO-MDPs) are introduced, a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities and a polynomial bound on its sample complexity is proved.
A new Q(lambda) with interim forward view and Monte Carlo equivalence
A new version of Q(λ) is introduced that does exactly that, without significantly increased algorithmic complexity, and introduces a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ), but extended to apply at every time step rather than only at the end of episodes.
The Dependence of Effective Planning Horizon on Model Accuracy
It is shown formally that the planning horizon is a complexity control parameter for the class of policies to be learned and has an intuitive, monotonic relationship with a simple counting measure of complexity, and that a similar relationship can be observed empirically with a more general and data-dependent Rademacher complexity measure.
GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces
The GQ(λ) algorithm is introduced which can be seen as extension of that work to a more general setting including eligibility traces and off-policy learning of temporally abstract predictions.
Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis
A generalization of the recently introduced emphatic temporal differences (ETD) algorithm, which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases, is proposed, where the introduced parameter β controls the decay rate of an importance-sampling term.
True online TD(λ)
This paper introduces a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it, and uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces.