Topology-aware Generalization of Decentralized SGD

  title={Topology-aware Generalization of Decentralized SGD},
  author={Tongtian Zhu and Fengxiang He and Lance Zhang and Zhengyang Niu and Mingli Song and Dacheng Tao},
This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is O ( m/N +1 /m + λ 2 ) -stable in expectation in the non-convex non-smooth setting, where N is the total sample size of the whole system, m is the worker number, and 1 − λ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an O (1 /N +(( m − 1 λ 2 ) α 2 + m − α ) /N 1… 

Figures and Tables from this paper

Refined Convergence and Topology Learning for Decentralized SGD with Heterogeneous Data

This paper revisits the analysis of the popular Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity and argues that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce the otherwise detrimental effect of data heterogeneity on the convergence time of D- SGD.



Stability and Generalization of the Decentralized Stochastic Gradient Descent

Leveraging this formulation together with (non)convex optimization theory, this paper establishes the first stability and generalization guarantees for the decentralized stochastic gradient descent.

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

This paper introduces a new stability measure called on-average model stability, for which novel bounds controlled by the risks of SGD iterates are developed, which gives the first-ever-known stability and generalization bounds for SGD with even non-differentiable loss functions.

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent.

Identity Mappings in Deep Residual Networks

The propagation formulations behind the residual building blocks suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

BlueFog, a python library for straightforward, high-performance implementations of diverse decentralized algorithms, is introduced, based on a unified abstraction of various communication operations, which offers intuitive interfaces to implement a spectrum of decentralized algorithms.

Exponential Graph is Provably Efficient for Decentralized Deep Training

This work proves so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes can lead to both fast communication and effective averaging simultaneously, and discovers that a sequence of log(n) one-peer exponential graphs can together achieve exact averaging.

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.

Stochastic Gradient Push for Distributed Deep Learning

Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.