• Corpus ID: 59553459

TF-Replicator: Distributed Machine Learning for Researchers

  title={TF-Replicator: Distributed Machine Learning for Researchers},
  author={Peter Buchlovsky and David Budden and Dominik Grewe and Chris Jones and John Aslanides and Frederic Besse and Andy Brock and Aidan Clark and Sergio Gomez Colmenarejo and Aedan Pope and Fabio Viola and Dan Belov},
We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF… 

Figures and Tables from this paper

Stabilizing Transformers for Reinforcement Learning

The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture.

Data Movement Is All You Need: A Case Study of Transformer Networks

This work finds that data movement is the key bottleneck when training, and presents a recipe for globally optimizing data movement in transformers, applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs With Hybrid Parallelism

This work presents scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks, and enables training of CosmoFlow with much larger samples than previously possible, realizing an order-of-magnitude improvement in prediction accuracy.

Large Scale Adversarial Representation Learning

This work builds upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator, and demonstrates that these generation-based models achieve the state of the art in unsupervised representation learning on ImageNet, as well as in unconditional image generation.

Data Movement Is All You Need: A Case Study on Optimizing Transformers

This work finds that data movement is the key bottleneck when training, and presents a recipe for globally optimizing data movement in transformers to achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT.

Two Routes to Scalable Credit Assignment without Weight Symmetry

This work investigates a recently proposed local learning rule that yields competitive performance with backpropagation and finds that it is highly sensitive to metaparameter choices, requiring laborious tuning that does not transfer across network architecture and investigates several non-local learning rules that relax the need for instantaneous weight transport into a more biologically-plausible "weight estimation" process.

Regularized Hierarchical Policies for Compositional Transfer in Robotics

This work develops and investigates simple hierarchical inductive biases -- in the form of structured policies -- as a mechanism for knowledge transfer across tasks in reinforcement learning (RL) and designs an RL algorithm that enables stable and fast learning.

Compositional Transfer in Hierarchical Reinforcement Learning

Regularized Hierarchical Policy Optimization (RHPO) is introduced to improve data-efficiency for domains with multiple dominant tasks and ultimately reduce required platform time and demonstrates substantial data- efficiency and final performance gains over competitive baselines in a week-long, physical robot stacking experiment.

Adversarial Video Generation on Complex Datasets

This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work.

Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review

This article focuses on surveying each of the four research directions, providing a comprehensive review of the state-of-the-art tools and techniques for efficient edge inference of deep neural networks.



TensorFlow: A system for large-scale machine learning

The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.

SparkNet: Training Deep Networks in Spark

This work introduces SparkNet, a framework for training deep networks in Spark using a simple parallelization scheme for stochastic gradient descent that scales well with the cluster size and tolerates very high-latency communication.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Project Adam: Building an Efficient and Scalable Deep Learning Training System

The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

A new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) is developed that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation.

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

This work builds a highly scalable deep learning training system for dense GPU clusters with three main contributions: a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy, an optimization approach for extremely large mini-batch size that can train CNN models on the ImageNet dataset without lost accuracy, and highly optimized all-reduce algorithms.

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

The API design and the system implementation of MXNet are described, and it is explained how embedding of both symbolic expression and tensor operation is handled in a unified fashion.

GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server

GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.

Horovod: fast and easy distributed deep learning in TensorFlow

Horovod is an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow.