• Corpus ID: 235485466

Distributed Deep Learning in Open Collaborations

  title={Distributed Deep Learning in Open Collaborations},
  author={Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitry Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups… 
1 Citations

Figures and Tables from this paper

Datasets: A Community Library for Natural Language Processing
Datasets is a community library for contemporary NLP designed to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora.


Distributed Deep Learning Using Volunteer Computing-Like Paradigm
  • Medha Atre, B. Jha, Ashwini Rao
  • Computer Science
    2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2021
This work designs a distributed solution that can run DL training on a VC system by using a data parallel approach and implements a novel asynchronous SGD scheme called VC-ASGD suited for VC systems that lower cost by 70-90% and improve data security.
Multi-node Bert-pretraining: Cost-efficient Approach
This work demonstrates that the BERT pre-trained model can be trained within 2 weeks on an academic-size cluster of widely available GPUs through careful algorithmic and software optimizations, and presents these optimizations on how to improve single device training throughput, distribute the training workload over multiple nodes and GPUs, and overcome the communication bottleneck introduced by the large data exchanges over the network.
A hybrid GPU cluster and volunteer computing platform for scalable deep learning
This work presents the hybrid cluster and volunteer computing platform that scales out GPU clusters into volunteer computing for distributed deep learning and shows an efficient use of the hybrid platform at sub-linear speedup.
Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Decentralized Deep Learning with Arbitrary Communication Compression
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
Stochastic Gradient Push for Distributed Deep Learning
Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.
Project Adam: Building an Efficient and Scalable Deep Learning Training System
The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
  • S. Shi, Xiaowen Chu
  • Computer Science
    2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
  • 2018
This study evaluates the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi- GPU, and multi-node environments and identifies bottlenecks and overheads which could be further optimized.
SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
Inspired by the BMUF method, this work proposes a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm, and provides theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses.
Communication-Efficient Learning of Deep Networks from Decentralized Data
This work presents a practical method for the federated learning of deep networks based on iterative model averaging, and conducts an extensive empirical evaluation, considering five different model architectures and four datasets.