Corpus ID: 238583300

An Empirical Study on Compressed Decentralized Stochastic Gradient Algorithms with Overparameterized Models

@article{Rao2021AnES,
  title={An Empirical Study on Compressed Decentralized Stochastic Gradient Algorithms with Overparameterized Models},
  author={Arjun Ashok Rao and Hoi-To Wai},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.04523}
}
  • Arjun Ashok Rao, Hoi-To Wai
  • Published 9 October 2021
  • Computer Science, Mathematics
  • ArXiv
This paper considers decentralized optimization with application to machine learning on graphs. The growing size of neural network (NN) models has motivated prior works on decentralized stochastic gradient algorithms to incorporate communication compression. On the other hand, recent works have demonstrated the favorable convergence and generalization properties of overparameterized NNs. In this work, we present an empirical analysis on the performance of compressed decentralized stochastic… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 32 REFERENCES
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
TLDR
This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. Expand
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
TLDR
This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix. Expand
A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!
TLDR
This work proposes a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators to the communicated messages and obtains the first scheme that converges linearly on strongly convex decentralized problems while using compressed communication only. Expand
Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: Joint Gradient Estimation and Tracking
TLDR
This work proposes an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) and gradient tracking (which tracks the global full gradient using local estimates). Expand
Communication Compression for Decentralized Training
TLDR
This paper develops a framework of quantized, decentralized training and proposes two different strategies, which are called extrapolation compression and difference compression, which outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with high latency and low bandwidth. Expand
Decentralized Deep Learning with Arbitrary Communication Compression
TLDR
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. Expand
The Convergence of Sparsified Gradient Methods
TLDR
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. Expand
Understanding Top-k Sparsification in Distributed Deep Learning
TLDR
The property of gradient distribution is exploited to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Expand
signSGD: compressed optimisation for non-convex problems
TLDR
SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. Expand
NEXT: In-Network Nonconvex Optimization
  • P. Lorenzo, G. Scutari
  • Computer Science, Mathematics
  • IEEE Transactions on Signal and Information Processing over Networks
  • 2016
TLDR
This work introduces the first algorithmic framework for the distributed minimization of the sum of a smooth function-the agents' sum-utility-plus a convex (possibly nonsmooth and nonseparable) regularizer, and shows that the new method compares favorably to existing distributed algorithms on both convex and nonconvex problems. Expand
...
1
2
3
4
...