• Corpus ID: 3398835

Horovod: fast and easy distributed deep learning in TensorFlow

  title={Horovod: fast and easy distributed deep learning in TensorFlow},
  author={Alexander Sergeev and Mike Del Balso},
Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter… 

Figures from this paper

Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers
CROSSBOW is described, a new single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size - however small - while scaling to multiple GPUs and introduces SMA, a synchronous variant of model averaging in which replicas independently explore the solution space with gradient descent.
Workload-aware Automatic Parallelization for Multi-GPU DNN Training
This work proposes a workload-aware auto-parallelization framework (WAP) for DNN training, where the work is automatically distributed to multiple GPUs based on the workload characteristics, and shows competitive training throughput compared with the state-of-the-art frameworks.
Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
This work explores hybrid parallelization, where each data parallel worker comprises more than one device to accelerate each training step by exploiting model parallelism, and shows that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone.
Project CGX: Scalable Deep Learning on Commodity GPUs
This paper investigates whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and proposes a framework called CGX, which provides efficient software support for communication compression, and is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support.
A Linear Algebraic Approach to Model Parallelism in Deep Learning
This work proposes a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN, and builds distributed DNN layers using these parallel primitives.
DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning
DeepReduce is a versatile framework for the compressed communication of sparse tensors, tailored for distributed deep learning and transmits fewer data and imposes lower computational overhead than existing methods, without aecting the training accuracy.
Hydra: A Scalable and Optimized Data Systemfor Large Multi-Model Deep Learning
  • Computer Science
  • 2021
This work devise a set of techniques to enable seamless training of very large DL models in both single-GPU and multi-GPU cases, and exploits a hitherto unexplored avenue for parallelism in this context, namely, multi-model execution such as during model selection.
Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment
The mystery of comparison in training speed on single-GPU between TensorFlow and PyTorch is unraveled and some key factors that affect the performance are identified, which can direct the end-users to write their models more efficiently.
Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines
This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance, and shows that with the new flexible communication schemes, the CPU time spent during training is reduced by 2-11X, and the implementation can achieve up to 10X speedups when CPU core limits are imposed.
Benchmarking performance of RaySGD and Horovod for big data applications
Two lightweight libraries for distributed deep learning, RaySGD and Horovod, are focused on, which aim to alleviate the challenges of a distributed training setup by providing support for seamless parallellization.


TensorFlow: A system for large-scale machine learning
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI, which provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons.
Bandwidth optimal all-reduce algorithms for clusters of workstations
Message Passing Interface (MPI) Forum Home Page
  • http://www.mpi-forum. org
  • 2017
Meet Michelangelo: Uber’s machine learning platform
  • https://eng.uber.com/michelangelo/,
  • 2017
Online; accessed 6-December-2017
  • Keras. https://github.com/fchollet/keras,
  • 2015
$ mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py WHAT DOES THIS DO?
    baidu-research/tensorflow-allreduce. https://github. com/baidu-research/tensorflow-allreduce, 2017
    • [Online; accessed
    • 2017