• Corpus ID: 235485466

Distributed Deep Learning in Open Collaborations

@article{Diskin2021DistributedDL,
  title={Distributed Deep Learning in Open Collaborations},
  author={Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitry Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.10207}
}
Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups… 

Figures and Tables from this paper

SWARM PARALLELISM: TRAINING LARGE MODELS CAN BE SURPRISINGLY COMMUNICATION-EFFICIENT

  • Computer Science
  • 2021
This work proposes SWARM Parallelism — a model-parallel training algorithm designed for swarms of poorly connected, heterogeneous unreliable devices that creates temporary randomized pipelines between available nodes that are rebalanced in case of failure.

Decentralized Training of Foundation Models in Heterogeneous Environments

This paper presents the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network, and provides a formal cost model and an efficient evolutionary algorithm to find the optimal allocation strategy.

Training Transformers Together

This paper collaboratively trained a text-to-image transformer similar to OpenAI DALL-E that generates images of reasonable quality on a number of prompts and explains how to address the engineering challenges associated with such a training run.

Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top

Theoretical convergence guarantees are derived for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions and the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients.

Datasets: A Community Library for Natural Language Processing

After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks.

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

This work introduces BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature, and achieves state-of-the-art results outperforming multilingual and monolingual models.

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

An overview of the current state of NLP research for Indonesia's 700+ languages is provided and general recommendations are provided to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.

References

SHOWING 1-10 OF 134 REFERENCES

Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

This work proposes Learning@home: a novel neural network training paradigm designed to handle large amounts of poorly connected participants and analyzes the performance, reliability, and architectural constraints of this paradigm and compare it against existing distributed training techniques.

Distributed Deep Learning Using Volunteer Computing-Like Paradigm

  • Medha AtreB. JhaAshwini Rao
  • Computer Science
    2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2021
This work designs a distributed solution that can run DL training on a VC system by using a data parallel approach and implements a novel asynchronous SGD scheme called VC-ASGD suited for VC systems that lower cost by 70-90% and improve data security.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Multi-node Bert-pretraining: Cost-efficient Approach

This work demonstrates that the BERT pre-trained model can be trained within 2 weeks on an academic-size cluster of widely available GPUs through careful algorithmic and software optimizations, and presents these optimizations on how to improve single device training throughput, distribute the training workload over multiple nodes and GPUs, and overcome the communication bottleneck introduced by the large data exchanges over the network.

A hybrid GPU cluster and volunteer computing platform for scalable deep learning

This work presents the hybrid cluster and volunteer computing platform that scales out GPU clusters into volunteer computing for distributed deep learning and shows an efficient use of the hybrid platform at sub-linear speedup.

Decentralized Deep Learning with Arbitrary Communication Compression

The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.

Stochastic Gradient Push for Distributed Deep Learning

Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

This work proposes Moshpit All-Reduce — an iterative averaging protocol that exponentially converges to the global average that demonstrates the efficiency of this protocol for distributed optimization with strong theoretical guarantees.

Project Adam: Building an Efficient and Scalable Deep Learning Training System

The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

  • S. ShiXiaowen Chu
  • Computer Science
    2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
  • 2018
This study evaluates the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi- GPU, and multi-node environments and identifies bottlenecks and overheads which could be further optimized.
...