BAGUA: Scaling up Distributed Learning with System Relaxations

@article{Gan2021BAGUASU,
  title={BAGUA: Scaling up Distributed Learning with System Relaxations},
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hong-fan Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.01499}
}
Recently years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via “system relaxations”: quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic… 
BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning
TLDR
BlueFog, a python library for straightforward, high-performance implementations of diverse decentralized algorithms, is introduced, based on a unified abstraction of various communication operations, which offers intuitive interfaces to implement a spectrum of decentralized algorithms.
DAPHNE: An Open and Extensible System Infrastructurefor Integrated Data Analysis Pipelines
TLDR
The overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations are described.
Persia: A Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
TLDR
A novel hybrid training algorithm is designed, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then a system called Persia is built to support this hybridTraining algorithm.
Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines
TLDR
A novel algorithmic framework that computes Shapley values of training examples over an end-to-end ML pipeline that is up to four orders of magnitude faster over state-of-the-art Monte Carlo-based methods, while being comparably, and often even more, effective in data debugging.
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
TLDR
A novel hybrid training algorithm is designed, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then a system called Persia is built to support this hybridTraining algorithm.
DDA3C: Cooperative Distributed Deep Reinforcement Learning in A Group-Agent System
TLDR
This work proposes a distributed deep reinforcement learning algorithm called DDA3C (Decentralised Distributed Asynchronous Advantage Actor-Critic) that is the first framework designed for group-agent reinforcement learning and shows that it achieved desirable performance and has good scalability in the CartPolev0 game environment.

References

SHOWING 1-10 OF 89 REFERENCES
Distributed Learning Systems with First-Order Methods
TLDR
A brief introduction of some distributed learning techniques that have recently been developed, namely lossy communication compression (e.g., quantization and sparsification), asynchronous communication, and decentralized communication are provided.
Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training
TLDR
This paper carefully analyzes the AllReduce based setup, proposes timing models which include network latency, bandwidth, cluster size and compute time, and demonstrates that a pipelined training with a width of two combines the best of both synchronous and asynchronous training.
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Heterogeneity-aware Distributed Parameter Servers
TLDR
A heterogeneity-aware algorithm that uses a constant learning rate schedule for updates before adding them to the global parameter allows us to suppress stragglers' harm on robust convergence and theoretically prove the valid convergence of both approaches.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
TLDR
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
Communication Compression for Decentralized Training
TLDR
This paper develops a framework of quantized, decentralized training and proposes two different strategies, which are called extrapolation compression and difference compression, which outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with high latency and low bandwidth.
High-Performance Distributed ML at Scale through Parameter Server Consistency Models
TLDR
This work studies both the theoretical guarantees and empirical behavior of iterative-convergent ML algorithms in existing PS consistency models, and uses the gleaned insights to improve a consistency model using an "eager" PS communication mechanism, and implement it as a new PS system that enables ML algorithms to reach their solution more quickly.
Angel: a new large-scale machine learning system
TLDR
A new system, named Angel, to facilitate the development of large-scale ML applications in production environment, which reduces the network latency by overlapping the parameter pulling and update computing and also utilizes the sparseness of data to avoid the pulling of unnecessary parameters.
Sparsified SGD with Memory
TLDR
This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
TLDR
This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.
...
1
2
3
4
5
...