• Corpus ID: 234469710

# Return-based Scaling: Yet Another Normalisation Trick for Deep RL

@article{Schaul2021ReturnbasedSY,
title={Return-based Scaling: Yet Another Normalisation Trick for Deep RL},
author={Tom Schaul and Georg Ostrovski and Iurii Kemaev and Diana Borsa},
journal={ArXiv},
year={2021},
volume={abs/2105.05347}
}
• Published 11 May 2021
• Computer Science
• ArXiv
Scaling issues are mundane yet irritating for practitioners of reinforcement learning. Error scales vary across domains, tasks, and stages of learning; sometimes by many orders of magnitude. This can be detrimental to learning speed and stability, create interference between learning tasks, and necessitate substantial tuning. We revisit this topic for agents based on temporal-difference learning, sketch out some desiderata and investigate scenarios where simple fixes fall short. The mechanism…
4 Citations

## Figures and Tables from this paper

When should agents explore?
• Computer Science
ArXiv
• 2021
An initial study of mode-switching, nonmonolithic exploration for RL, using two-mode exploration and switching at sub-episodic time-scales on Atari, and proposes practical algorithmic components that make the switching mechanism adaptive and robust.
Deep Reinforcement Learning at the Edge of the Statistical Precipice
• Computer Science
NeurIPS
• 2021
This paper argues that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field, and advocates for reporting interval estimates of aggregate performance and proposing performance profiles to account for the variability in results.
D OJO : A B ENCHMARK FOR L ARGE S CALE M ULTI T ASK R EINFORCEMENT L EARNING
This work introduces Dojo, a reinforcement learning environment intended as a benchmark for evaluating RL agents’ capabilities in the areas of multi-task learning, generalization, transfer learning, and curriculum learning and empirically demonstrates its suitability for the purpose of studying cross-task generalization.
The Phenomenon of Policy Churn
• Psychology
ArXiv
• 2022
It is hypothesised that policy churn is a beneﬁcial but overlooked form of implicit exploration that casts ( cid:15) -greedy exploration in a fresh light, namely that (cid: 15) -noise plays a much smaller role than expected.

## References

SHOWING 1-10 OF 42 REFERENCES
Adapting to Reward Progressivity via Spectral Reinforcement Learning
• Computer Science, Psychology
ICLR
• 2021
Spectral DQN is proposed, which decomposes the reward into frequencies such that the high frequencies only activate when large rewards are found, and allows the training loss to be balanced so that it gives more even weighting across small and large reward regions.
Observe and Look Further: Achieving Consistent Performance on Atari
• Computer Science
ArXiv
• 2018
This paper proposes an algorithm that addresses three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently.
Learning values across many orders of magnitude
• Computer Science
NIPS
• 2016
This work proposes to adaptively normalize the targets used in learning, useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when the policy of behavior changes.
Adapting Behaviour for Learning Progress
• Computer Science
ArXiv
• 2019
This work proposes to dynamically adapt the data generation by using a non-stationary multi-armed bandit to optimize a proxy of the learning progress to produce results comparable to per-task tuning at a fraction of the cost.
Agent57: Outperforming the Atari Human Benchmark
• Computer Science
ICML
• 2020
This work proposes Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games and trains a neural network which parameterizes a family of policies ranging from very exploratory to purely exploitative.
Reinforcement Learning with Unsupervised Auxiliary Tasks
• Computer Science
ICLR
• 2017
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
Generalization and Regularization in DQN
• Computer Science
ArXiv
• 2018
Despite regularization being largely underutilized in deep RL, it is shown that it can, in fact, help DQN learn more general features and can then be reused and fine-tuned on similar tasks, considerably improving the sample efficiency of D QN.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
• Computer Science
ICML
• 2018
A new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) is developed that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation.
Multi-task Deep Reinforcement Learning with PopArt
• Computer Science
AAAI
• 2019
This work proposes to automatically adapt the contribution of each task to the agent’s updates, so that all tasks have a similar impact on the learning dynamics, and learns a single trained policy that exceeds median human performance on this multi-task domain.
Natural Value Approximators: Learning when to Trust Past Estimates
• Computer Science
NIPS
• 2017
This work proposes a mechanism that learns an interpolation between a direct value estimate and a projected value estimate computed from the encountered reward and the previous estimate, which reduces the need to learn about discontinuities, and thus improves the value function approximation.