Optimizing agent behavior over long time scales by transporting value

@article{Hung2019OptimizingAB,
  title={Optimizing agent behavior over long time scales by transporting value},
  author={Chia-Chun Hung and Timothy P. Lillicrap and Josh Abramson and Yan Wu and Mehdi Mirza and Federico Carnevale and Arun Ahuja and Greg Wayne},
  journal={Nature Communications},
  year={2019},
  volume={10}
}
Humans prolifically engage in mental time travel. We dwell on past actions and experience satisfaction or regret. More than storytelling, these recollections change how we act in the future and endow us with a computationally important ability to link actions and consequences across spans of time, which helps address the problem of long-term credit assignment: the question of how to evaluate the utility of actions within a long-duration behavioral sequence. Existing approaches to credit… 
Episodic Memory for Learning Subjective-Timescale Models
TLDR
This work devise a novel approach to learning a transition dynamics model, based on the sequences of episodic memories that define the agent's subjective timescale - over which it learns world dynamics and over which future planning is performed.
Episodic Memory for Subjective-Timescale Models
Planning in complex environments requires reasoning over multi-step timescales. However, in model-based learning, an agent’s model is more commonly defined over transitions between consecutive
Towards mental time travel: a hierarchical memory for reinforcement learning agents
TLDR
Hierarchical Chunk Attention Memory improves agent sample efficiency, generalization, and generality (by solving tasks that previously required specialized architectures), and is a step towards agents that can learn, interact, and adapt in complex and temporallyextended environments.
Counterfactual Credit Assignment in Model-Free Reinforcement Learning
TLDR
This work adapts the notion of counterfactuals from causality theory to a model-free RL setup and proposes to use these as future-conditional baselines and critics in policy gradient algorithms and develops a valid, practical variant with provably lower variance.
Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act
TLDR
This work proposes to augment the standard Markov Decision Process and make a new mode of action available: being lazy, which defers decision-making to a default policy, and names the resulting formalism lazy-MDPs, which observes that agents are able to get competitive performance in Atari games while only taking control in a limited subset of states.
Learning What to Remember: Strategies for Selective External Memory in Online Reinforcement Learning Agents
TLDR
This thesis develops a novel method, called online policy gradient over a reservoir (OPGOR), for selecting what to remember from the stream of observation, and explores a number of alternative methods for handling this selective memory problem.
The act of remembering: a study in partially observable reinforcement learning
TLDR
This paper studies a lightweight approach to tackle partial observability in RL by providing the agent with an external memory and additional actions to control what, if anything, is written to the memory.
Pairwise Weights for Temporal Credit Assignment
TLDR
This empirical paper explores heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two.
Interval timing in deep reinforcement learning agents
TLDR
This work characterize the strategies developed by recurrent and feedforward agents, which both succeed at temporal reproduction using distinct mechanisms, some of which bear specific and intriguing similarities to biological systems.
Learning Guidance Rewards with Trajectory-space Smoothing
TLDR
It is shown that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks, and used in a few popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and presented results that elucidate the benefit of the approach when the environmental rewards are sparse or delayed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 64 REFERENCES
Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework
TLDR
It is suggested that the ubiquitous and diverse roles of memory in RL may function as part of an integrated learning system that can efficiently approximate value functions over complex state spaces, learn with very little data, and bridge long‐term dependencies between actions and rewards.
Solving the credit assignment problem: explicit and implicit learning of action sequences with probabilistic outcomes
TLDR
The results show that credit assignment involves two processes: an explicit memory encoding process that requires memory rehearsals and an implicit reinforcement-learning process that propagates credits backwards to previous choices.
Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding
TLDR
A novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states is studied.
Unsupervised Predictive Memory in a Goal-Directed Agent
TLDR
A model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling, demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences.
Delaying execution of intentions: overcoming the costs of interruptions
In real-world settings, execution of retrieved intentions must often be briefly delayed until an ongoing activity is completed (delayed-execute prospective memory tasks). Further, in demanding work
Hippocampal Contributions to Control: The Third Way
TLDR
This work argues here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices, and interprets data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis.
RUDDER: Return Decomposition for Delayed Rewards
TLDR
RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward, and return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels.
Human-level control through deep reinforcement learning
TLDR
This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
...
1
2
3
4
5
...