• Corpus ID: 234469710

Return-based Scaling: Yet Another Normalisation Trick for Deep RL

@article{Schaul2021ReturnbasedSY,
  title={Return-based Scaling: Yet Another Normalisation Trick for Deep RL},
  author={Tom Schaul and Georg Ostrovski and Iurii Kemaev and Diana Borsa},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.05347}
}
Scaling issues are mundane yet irritating for practitioners of reinforcement learning. Error scales vary across domains, tasks, and stages of learning; sometimes by many orders of magnitude. This can be detrimental to learning speed and stability, create interference between learning tasks, and necessitate substantial tuning. We revisit this topic for agents based on temporal-difference learning, sketch out some desiderata and investigate scenarios where simple fixes fall short. The mechanism… 
When should agents explore?
TLDR
An initial study of mode-switching, nonmonolithic exploration for RL, using two-mode exploration and switching at sub-episodic time-scales on Atari, and proposes practical algorithmic components that make the switching mechanism adaptive and robust.
Deep Reinforcement Learning at the Edge of the Statistical Precipice
TLDR
This paper argues that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field, and advocates for reporting interval estimates of aggregate performance and proposing performance profiles to account for the variability in results.
D OJO : A B ENCHMARK FOR L ARGE S CALE M ULTI T ASK R EINFORCEMENT L EARNING
TLDR
This work introduces Dojo, a reinforcement learning environment intended as a benchmark for evaluating RL agents’ capabilities in the areas of multi-task learning, generalization, transfer learning, and curriculum learning and empirically demonstrates its suitability for the purpose of studying cross-task generalization.
The Phenomenon of Policy Churn
TLDR
It is hypothesised that policy churn is a beneficial but overlooked form of implicit exploration that casts ( cid:15) -greedy exploration in a fresh light, namely that (cid: 15) -noise plays a much smaller role than expected.

References

SHOWING 1-10 OF 42 REFERENCES
Adapting to Reward Progressivity via Spectral Reinforcement Learning
TLDR
Spectral DQN is proposed, which decomposes the reward into frequencies such that the high frequencies only activate when large rewards are found, and allows the training loss to be balanced so that it gives more even weighting across small and large reward regions.
Observe and Look Further: Achieving Consistent Performance on Atari
TLDR
This paper proposes an algorithm that addresses three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently.
Learning values across many orders of magnitude
TLDR
This work proposes to adaptively normalize the targets used in learning, useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when the policy of behavior changes.
Adapting Behaviour for Learning Progress
TLDR
This work proposes to dynamically adapt the data generation by using a non-stationary multi-armed bandit to optimize a proxy of the learning progress to produce results comparable to per-task tuning at a fraction of the cost.
Agent57: Outperforming the Atari Human Benchmark
TLDR
This work proposes Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games and trains a neural network which parameterizes a family of policies ranging from very exploratory to purely exploitative.
Reinforcement Learning with Unsupervised Auxiliary Tasks
TLDR
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
Generalization and Regularization in DQN
TLDR
Despite regularization being largely underutilized in deep RL, it is shown that it can, in fact, help DQN learn more general features and can then be reused and fine-tuned on similar tasks, considerably improving the sample efficiency of D QN.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
TLDR
A new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) is developed that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation.
Multi-task Deep Reinforcement Learning with PopArt
TLDR
This work proposes to automatically adapt the contribution of each task to the agent’s updates, so that all tasks have a similar impact on the learning dynamics, and learns a single trained policy that exceeds median human performance on this multi-task domain.
Natural Value Approximators: Learning when to Trust Past Estimates
TLDR
This work proposes a mechanism that learns an interpolation between a direct value estimate and a projected value estimate computed from the encountered reward and the previous estimate, which reduces the need to learn about discontinuities, and thus improves the value function approximation.
...
...