• Corpus ID: 233219860

Podracer architectures for scalable Reinforcement Learning

  title={Podracer architectures for scalable Reinforcement Learning},
  author={Matteo Hessel and Manuel Kroiss and Aidan Clark and Iurii Kemaev and John Quan and Thomas Keck and Fabio Viola and H. V. Hasselt},
Supporting state-of-the-art AI research requires balancing rapid prototyping, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems. Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as TPUs and GPUs, to offload the more computationally intensive parts of training and inference in modern deep learning systems. Popular training pipelines that use… 

Figures from this paper

ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning
A scalable and elastic library ElegantRL-podracer for cloud-native deep reinforcement learning, which efficiently supports millions of GPU cores to carry out massively parallel training at multiple levels, and substantially outperforms RLlib.
FinRL-Podracer: High Performance and Scalable Deep Reinforcement Learning for Quantitative Finance
An RLOps in finance paradigm is proposed and a FinRL-Podracer framework is presented to accelerate the development pipeline of deep reinforcement learning (DRL)-driven trading strategy and to improve both trading performance and training efficiency.
CoBERL: Contrastive BERT for Reinforcement Learning
This work proposes Contrastive BERT for RL (COBERL), an agent that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency.
Emphatic Algorithms for Deep Reinforcement Learning
The use of emphatic methods are extended to multi-step deep RL learning targets, including an off-policy value-learning method known as ‘V-trace’ (Espeholt et al., 2018) that is often used in actor-critic systems.
Proper Value Equivalence
A loss function is constructed for learning PVE models and it is argued that popular algorithms such as MuZero can be understood as minimizing an upper bound for this loss.
Muesli: Combining Improvements in Policy Optimization
A novel policy update that combines regularized policy optimization with model learning as an auxiliary loss and does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines.
Self-Consistent Models and Values
This work investigates a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly self-consistent, and finds that, with appropriate choices, self- Consistency helps both policy evaluation and control.


SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference
A modern scalable reinforcement learning agent called SEED (Scalable, Efficient Deep-RL), which is able to train on millions of frames per second and lower the cost of experiments compared to current methods with a simple architecture.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
A new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) is developed that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation.
Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Asynchronous Methods for Deep Reinforcement Learning
A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
Massively Parallel Methods for Deep Reinforcement Learning
This work presents the first massively distributed architecture for deep reinforcement learning, using a distributed neural network to represent the value function or behaviour policy, and a distributed store of experience to implement the Deep Q-Network algorithm.
Distributed Prioritized Experience Replay
This work proposes a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible, and substantially improves the state of the art on the Arcade Learning Environment.
In-datacenter performance analysis of a tensor processing unit
  • N. Jouppi, C. Young, +73 authors D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Meta-Gradient Reinforcement Learning with an Objective Discovered Online
This work proposes an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network, solely from interactive experience with its environment, and adapts over time to learn with greater efficiency.
RLlib: Abstractions for Distributed Reinforcement Learning
This work argues for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks, through RLlib: a library that provides scalable software primitives for RL.
Rainbow: Combining Improvements in Deep Reinforcement Learning
This paper examines six extensions to the DQN algorithm and empirically studies their combination, showing that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance.