• Corpus ID: 231855626

Consensus Control for Decentralized Deep Learning

  title={Consensus Control for Decentralized Deep Learning},
  author={Lingjing Kong and Tao Lin and Anastasia Koloskova and Martin Jaggi and Sebastian U. Stich},
Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a data-center setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by… 
DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training
  • K. Yuan, Yiming Chen, W. Yin
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
This work finds the momentum term can amplify the inconsistency bias in DmSGD and proposes DecentLaM, a novel decentralized large-batch momentum SGD to remove the momentum-incurred bias.
Decentralized Local Stochastic Extra-Gradient for Variational Inequalities
A novel method—based on stochastic extragradient—where participating devices can communicate over arbitrary, possibly time-varying network topologies where the problem data is distributed across many participating devices (heterogeneous, or non-IID data setting).
D-Cliques: Compensating NonIIDness in Decentralized Federated Learning with Topology
D-Cliques is presented, a novel topology that reduces gradient bias by grouping nodes in interconnected cliques such that the local joint distribution in a clique is representative of the global class distribution.
Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data
This paper investigates and identifies the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity, and proposes a novel momentum-based method to mitigate this decentralized training difficulty.
Yes, Topology Matters in Decentralized Optimization: Refined Convergence and Topology Learning under Heterogeneous Data
This paper revisits the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD), a popular decentralized learning algorithm, under data heterogeneity and argues that neighborhood heterogeneity provides a natural criterion to learn sparse data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental impact of data heterogeneity on the convergence time of D- SGD.
Exponential Graph is Provably Efficient for Decentralized Deep Training
This work proves so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes can lead to both fast communication and effective averaging simultaneously, and discovers that a sequence of log(n) one-peer exponential graphs can together achieve exact averaging.
D-Cliques: Compensating for Data Heterogeneity with Topology in Decentralized Federated Learning
D-Cliques is presented, a novel topology that reduces gradient bias by grouping nodes in sparsely interconnected cliques such that the label distribution in a clique is representative of the global label distribution.
Topology-aware Generalization of Decentralized SGD
This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is O(m/N+1/m+λ)-stable in
Theoretical Analysis of Primal-Dual Algorithm for Non-Convex Stochastic Decentralized Optimization
The Generalized ECL is proposed, which contains the ECL as a special case, and provides the convergence rates of the G-ECL in both (strongly) convex and non-convex settings, which do not depend on the heterogeneity of data distributions.
Semi-Decentralized Federated Learning with Collaborative Relaying
A semi-decentralized federated learning algorithm wherein clients collaborate by relaying their neighbors’ local updates to a central parameter server (PS) shows an improved convergence rate and accuracy in comparison with the federated averaging algorithm.


Decentralized Deep Learning with Arbitrary Communication Compression
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent.
Extrapolation for Large-batch Training in Deep Learning
This work proposes to use computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima, and proves the convergence of this novel scheme and rigorously evaluates its empirical performance on ResNet, LSTM, and Transformer.
Decentralized gradient methods: does topology matter?
It is shown how sparse topologies can lead to faster convergence even in the absence of communication delays, and theoretical results suggest that worker communication topology should have strong impact on the number of epochs needed to converge.
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks
The efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression is verified.
Optimal Algorithms for Non-Smooth Distributed Optimization in Networks
The error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions, and the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate are provided.
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.
Asynchronous Decentralized Parallel Stochastic Gradient Descent
This paper proposes an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations and is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.