• Corpus ID: 243847842

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

@article{Ying2021BlueFogMD,
  title={BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning},
  author={Bicheng Ying and Kun Yuan and Hanbin Hu and Yiming Chen and Wotao Yin},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.04287}
}
Decentralized algorithm is a form of computation that achieves a global goal through local dynamics that relies on low-cost communication between directly-connected agents. On large-scale optimization tasks involving distributed datasets, decentralized algorithms have shown strong, sometimes superior, performance over distributed algorithms with a central node. Recently, developing decentralized algorithms for deep learning has attracted great attention. They are considered as low-communication… 
Exponential Graph is Provably Efficient for Decentralized Deep Training
TLDR
This work proves so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes can lead to both fast communication and effective averaging simultaneously, and discovers that a sequence of log(n) one-peer exponential graphs can together achieve exact averaging.
Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD
TLDR
By eliminating the influence of data heterogeneity between nodes, D/Exact-diffusion is shown to have an enhanced transient stage that is on the order of Ω̃(n/(1 − β)) and Ω(n)/ β for strongly and generally convex cost functions, respectively, which have the best (i.e., weakest) dependence on network topology to the authors' knowledge compared to existing decentralized algorithms.
Topology-aware Generalization of Decentralized SGD
TLDR
It is proved that the consensus model learned by D-SGD is O ( m/N +1 /m + λ 2 ) -stable in expectation in the non-convex non-smooth setting, which is non-vacuous even when λ is closed to 1, in con-trast to vacuous as suggested by existing literature.
Heavy-Tail Phenomenon in Decentralized SGD
TLDR
The theory uncovers an interesting interplay between the tails and the network structure: two regimes of parameters (stepsize and network size) are identified, where DE-SGD can have lighter or heavier tails than disconnected SGD depending on the regime.
On the Privacy of Decentralized Machine Learning
TLDR
It is demonstrated that, contrary to what is claimed by decentralized learning proposers, decentralized learning does not offer any security advantages over more practical approaches such as federated learning, and tends to degrade users’ privacy by increasing the attack surface.

References

SHOWING 1-10 OF 88 REFERENCES
Exponential Graph is Provably Efficient for Decentralized Deep Training
TLDR
This work proves so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes can lead to both fast communication and effective averaging simultaneously, and discovers that a sequence of log(n) one-peer exponential graphs can together achieve exact averaging.
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
TLDR
This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent.
Decentralized Deep Learning with Arbitrary Communication Compression
TLDR
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training
TLDR
The proposed Prague, a high-performance heterogeneity-aware asynchronous decentralized training approach, achieves the above goal with intensive synchronization optimization by exploring the interplay between algorithm and system implementation, or statistical and hardware efficiency.
RelaySum for Decentralized Deep Learning on Heterogeneous Data
TLDR
It is proved that RelaySGD, based on the RelaySum mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data.
Consensus Control for Decentralized Deep Learning
TLDR
It is shown in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart, and empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop.
On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization
TLDR
This paper considers a distributed communication efficient momentum SGD method and proves its linear speedup property, filling the gap in the study of distributed SGD variants with reduced communication.
Stochastic Gradient Push for Distributed Deep Learning
TLDR
Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
TLDR
This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
BAGUA: Scaling up Distributed Learning with System Relaxations
TLDR
BAGUA is built, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training.
...
...