• Publications
  • Influence
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
TLDR
We develop a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. Expand
  • 590
  • 118
  • PDF
A Distributional Perspective on Reinforcement Learning
TLDR
We argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. Expand
  • 527
  • 110
  • PDF
Unifying Count-Based Exploration and Intrinsic Motivation
TLDR
We use density models to measure uncertainty in non-tabular reinforcement learning, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model. Expand
  • 672
  • 94
  • PDF
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
TLDR
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. Expand
  • 283
  • 84
  • PDF
Minimax Regret Bounds for Reinforcement Learning
TLDR
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. Expand
  • 259
  • 68
  • PDF
Learning to reinforcement learn
TLDR
In recent years deep reinforcement learning (RL) systems have attained superhuman performance in a number of challenging task domains. Expand
  • 478
  • 60
  • PDF
Noisy Networks for Exploration
TLDR
We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. Expand
  • 411
  • 59
  • PDF
Kullback–Leibler upper confidence bounds for optimal sequential allocation
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upperExpand
  • 243
  • 55
  • PDF
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
TLDR
This paper considers a variant of the basic algorithm for the stochastic multi-armed bandit problem that takes into account the empirical variance of the different arms. Expand
  • 416
  • 53
  • PDF
Finite-Time Bounds for Fitted Value Iteration
TLDR
In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes. Expand
  • 291
  • 50
  • PDF
...
1
2
3
4
5
...