How Should an Agent Practice?

  title={How Should an Agent Practice?},
  author={Janarthanan Rajendran and Richard L. Lewis and Vivek Veeriah and Honglak Lee and Satinder Singh},
We present a method for learning intrinsic reward functions to drive the learning of an agent during periods of practice in which extrinsic task rewards are not available. During practice, the environment may differ from the one available for training and evaluation with extrinsic rewards. We refer to this setup of alternating periods of practice and objective evaluation as practice-match, drawing an analogy to regimes of skill acquisition common for humans in sports and games. The agent must… 

Figures from this paper

Discovery of Options via Meta-Learned Subgoals

A novel meta-gradient approach for discovering useful options in multi-task RL environments based on a manager-worker decomposition of the RL agent, in which a manager maximises rewards from the environment by learning a task-dependent policy over both a set of task-independent discovered-options and primitive actions.

Pairwise Weights for Temporal Credit Assignment

This empirical paper explores heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two.

Unsupervised Reinforcement Learning in Multiple Environments

This work foster an exploration strategy that is sensitive to the most adverse cases within the class, and presents a policy gradient algorithm, alphaMEPOL, to optimize the introduced objective through mediated interactions with the class.

Interesting Object, Curious Agent: Learning Task-Agnostic Exploration

This paper evaluates several baseline exploration strategies and presents a simple yet effective approach to learning task-agnostic exploration policies, showing that the formulation is effective and provides the most consistent exploration across several training-testing environment pairs.

Learning to Learn End-to-End Goal-Oriented Dialog From Related Dialog Tasks

This work describes a meta-learning based method that selectively learns from the related dialog task data, which leads to significant accuracy improvements in an example dialog task.



On Learning Intrinsic Rewards for Policy Gradient Methods

This paper derives a novel algorithm for learning intrinsic rewards for policy-gradient based learning agents and compares the performance of an augmented agent that uses this algorithm to provide additive intrinsic rewards to an A2C-based policy learner and a PPO-basedpolicy learner with a baselineAgent that uses the same policy learners but with only extrinsic rewards.

Diversity is All You Need: Learning Skills without a Reward Function

The proposed DIAYN ("Diversity is All You Need"), a method for learning useful skills without a reward function, learns skills by maximizing an information theoretic objective using a maximum entropy policy.

Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping

Conditions under which modi cations to the reward function of a Markov decision process preserve the op timal policy are investigated to shed light on the practice of reward shap ing a method used in reinforcement learn ing whereby additional training rewards are used to guide the learning agent.

Reward Design via Online Gradient Ascent

This work develops a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent's lifetime and demonstrates its ability to improve reward functions in agents with various forms of limitations.

Deep Learning for Reward Design to Improve Monte Carlo Tree Search in ATARI Games

An adaptation of PGRD (policy-gradient for reward-design) for learning a reward-bonus function to improve UCT (a MCTS algorithm), which improves UCT's performance on multiple ATARI games compared to UCT without the reward bonus.

Meta-Gradient Reinforcement Learning

A gradient-based meta-learning algorithm is discussed that is able to adapt the nature of the return, online, whilst interacting and learning from the environment and achieved a new state-of-the-art performance.

InfoBot: Transfer and Exploration via the Information Bottleneck

This work proposes to learn about decision states from prior experience by training a goal-conditioned policy with an information bottleneck, and finds that this simple mechanism effectively identifies decision states, even in partially observed settings.

Unsupervised Meta-Learning for Reinforcement Learning

The experimental results indicate that unsupervised meta-reinforcement learning effectively acquires accelerated reinforcement learning procedures without the need for manual task design and these procedures exceed the performance of learning from scratch.

Human-level control through deep reinforcement learning

This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective

A new optimal reward framework is defined that captures the pressure to design good primary reward functions that lead to evolutionary success across environments and shows that optimal primary reward signals may yield both emergent intrinsic and extrinsic motivation.