• Corpus ID: 231933673

MARINA: Faster Non-Convex Distributed Learning with Compression

@article{Gorbunov2021MARINAFN,
  title={MARINA: Faster Non-Convex Distributed Learning with Compression},
  author={Eduard A. Gorbunov and Konstantin Burlachenko and Zhize Li and Peter Richt{\'a}rik},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.07845}
}
We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences that is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al. (2019). Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator… 

Figures and Tables from this paper

Faster Rates for Compressed Federated Learning with Client-Variance Reduction
TLDR
Both COFIG and FRECON do not need to communicate with all the clients and provide first/faster convergence results for convex and nonconvex federated learning, while previous works either require full clients communication or obtain worse convergence results.
Accelerating Federated Learning via Sampling Anchor Clients with Large Batches
TLDR
A unified framework FedAMD is proposed, which disjoints the participants into anchor and miner groups based on time-varying probabilities, and achieves a convergence rate of O(1/ ) under non-convex objectives by sampling an anchor with a constant probability.
Federated Learning with a Sampling Algorithm under Isoperimetry
TLDR
This work proposes a communication-efficient variant of the Langevin algorithm to sample a posteriori, and analyzes the algorithm without assuming that the target distribution is strongly log-concave, which allows for nonconvexity.
DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization
TLDR
The theory of DASHA, a new family of methods for nonconvex distributed optimization problems, improves the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA and is corrobo-rated in practice in experiments with non Convex classification and training of deep learning models.
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback
TLDR
It is proved that EF21 enjoys a fast O(1/T ) convergence rate for smooth nonconvex problems, beating the previous bound of O( 1/T ), which was shown under a strong bounded gradients assumption.
QuAFL: Federated Averaging Can Be Both Asynchronous and Communication-Efficient
TLDR
This work jointly addresses two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between thecentral server and clients.
Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top
TLDR
Theoretical convergence guarantees are derived for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions and the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients.
BEER: Fast O(1/T) Rate for Decentralized Nonconvex Optimization with Communication Compression
TLDR
This paper proposes BEER, which adopts communication compression with gradient tracking, and shows it converges at a faster rate of O (1 /T ) than the state-of-the-art rate, by matching the rate without compression even under arbitrary data heterogeneity.
FL_PyTorch: optimization research simulator for federated learning
TLDR
FL_PyTorch is a suite of open-source software written in python that builds on top of one the most popular research Deep Learning (DL) framework PyTorch to enable fast development, prototyping and experimenting with new and existing FL optimization algorithms.
Compression and Data Similarity: Combination of Two Techniques for Communication-Efficient Solving of Distributed Variational Inequalities
TLDR
This paper considers a combination of two popular approaches: compression and data similarity, and shows that this synergy can be more effective than each of the approaches separately in solving distributed smooth strongly monotonic variational inequalities.
...
...

References

SHOWING 1-10 OF 49 REFERENCES
Federated Learning with Compression: Unified Analysis and Sharp Guarantees
TLDR
This work proposes a set of algorithms with periodical compressed (quantized or sparsified) communication and analyzes their convergence properties in both homogeneous and heterogeneous local data distributions settings and introduces a scheme to mitigate data heterogeneity.
Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations
TLDR
This paper proposes Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients, and demonstrates that it converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers.
LIBSVM: A library for support vector machines
TLDR
Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization
TLDR
The results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the theoretical results and confirming the practical superiority of PAGE.
A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization
TLDR
This paper provides a single convergence analysis for all methods that satisfy the proposed unified assumption of the second moment of the stochastic gradient, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant.
On Biased Compression for Distributed Learning
TLDR
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
Federated Learning: Strategies for Improving Communication Efficiency
TLDR
Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
TLDR
This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
Decentralized Deep Learning with Arbitrary Communication Compression
TLDR
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
...
...