• Corpus ID: 246430634

BEER: Fast O(1/T) Rate for Decentralized Nonconvex Optimization with Communication Compression

  title={BEER: Fast O(1/T) Rate for Decentralized Nonconvex Optimization with Communication Compression},
  author={Haoyu Zhao and Boyue Li and Zhize Li and Peter Richt'arik and Yuejie Chi},
Communication efficiency has been widely recognized as the bottleneck for large-scale decentralized machine learning applications in multi-agent or federated environments. To tackle the communication bottleneck, there have been many efforts to design communication-compressed algorithms for decentralized nonconvex optimization, where the clients are only allowed to communicate a small amount of quantized information (aka bits) with their neighbors over a predefined graph topology. Despite significant… 

Figures and Tables from this paper

A Multi-Token Coordinate Descent Method for Vertical Federated Learning

This work formalizes the multi-token semi-decentralized scheme, which subsumes the client-server and decentralized setups, and design a feature-distributed learning algorithm for this setup, which can be seen as a parallel Markov chain (block) coordinate descent algorithm.

Coresets for Vertical Federated Learning: Regularized Linear Regression and K-Means Clustering

This paper proposes a coreset framework by constructing coresets in a distributed fashion for communication-efficient VFL, and theoretically shows that using coresets can drastically alleviate the communication complexity, while nearly maintain the solution quality.

Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization

This work proposes and analyzes several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems, and proposes an optimal algorithm, called SSRGD, based on SARAH, which can find an -approximate (first-order) stationary point by simply adding some random perturbations.

SoteriaFL: A Unified Framework for Private Federated Learning with Communication Compression

A framework called SoteriaFL is proposed, which accommodates a general family of local gradient estimators including popular stochastic variance-reduced gradient methods and the state-of-the-art shifted compression scheme, and is shown to achieve better communication complexity without sacrificing privacy nor utility than other private federated learning algorithms without communication compression.

Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression

A convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection is established and an algorithm is proposed, NEOLITHIC, which almost reaches the lower bound (up to logarithm factors) under mild conditions.

DESTRESS: Computation-Optimal and Communication-Efficient Decentralized Nonconvex Finite-Sum Optimization

A new algorithm, called DEcentralized STochastic REcurSive gradient methodS (DESTRESS) for nonconvex optimization, which matches the optimal incremental incremental oracle complexity of centralized algorithms for stationary points, while maintaining communication ef-ficiency.



SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization

Theoretical understanding is corroborated with experiments and the performance of the algorithm is compared with the state-of-the-art, showing that without sacrificing much on the accuracy, SQuARM-SGD converges at a similar rate while saving significantly in total communicated bits.

Decentralized Deep Learning with Arbitrary Communication Compression

The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.

DeepSqueeze : Decentralization Meets Error-Compensated Compression

This paper proposes an elegant algorithmic design to employ error-compensated stochastic gradient descent for the decentralized scenario, named DeepSqueeze, and is the first time to apply the error-Compensated compression to the decentralized learning.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.

Sparsified SGD with Memory

This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent.

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

It is proved that EF21 enjoys a fast O(1/T ) convergence rate for smooth nonconvex problems, beating the previous bound of O( 1/T ), which was shown under a strong bounded gradients assumption.

A Compressed Gradient Tracking Method for Decentralized Optimization With Linear Convergence

It is shown that C-GT inherits the advantages of gradient tracking-based algorithms and achieves linear convergence rate for strongly convex and smooth objective functions.

Fast Decentralized Nonconvex Finite-Sum Optimization with Recursive Variance Reduction

Over infinite time horizon, it is established that all nodes in GT-SARAH asymptotically achieve consensus and converge to a first-order stationary point in the almost sure and mean-squared sense.