MARINA: Faster Non-Convex Distributed Learning with Compression
@article{Gorbunov2021MARINAFN, title={MARINA: Faster Non-Convex Distributed Learning with Compression}, author={Eduard A. Gorbunov and Konstantin Burlachenko and Zhize Li and Peter Richt{\'a}rik}, journal={ArXiv}, year={2021}, volume={abs/2102.07845} }
We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences that is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al. (2019). Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator…
Figures and Tables from this paper
32 Citations
Faster Rates for Compressed Federated Learning with Client-Variance Reduction
- Computer ScienceArXiv
- 2021
Both COFIG and FRECON do not need to communicate with all the clients and provide first/faster convergence results for convex and nonconvex federated learning, while previous works either require full clients communication or obtain worse convergence results.
Accelerating Federated Learning via Sampling Anchor Clients with Large Batches
- Computer ScienceArXiv
- 2022
A unified framework FedAMD is proposed, which disjoints the participants into anchor and miner groups based on time-varying probabilities, and achieves a convergence rate of O(1/ ) under non-convex objectives by sampling an anchor with a constant probability.
Federated Learning with a Sampling Algorithm under Isoperimetry
- Computer ScienceArXiv
- 2022
This work proposes a communication-efficient variant of the Langevin algorithm to sample a posteriori, and analyzes the algorithm without assuming that the target distribution is strongly log-concave, which allows for nonconvexity.
DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization
- Computer ScienceArXiv
- 2022
The theory of DASHA, a new family of methods for nonconvex distributed optimization problems, improves the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA and is corrobo-rated in practice in experiments with non Convex classification and training of deep learning models.
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback
- Computer ScienceNeurIPS
- 2021
It is proved that EF21 enjoys a fast O(1/T ) convergence rate for smooth nonconvex problems, beating the previous bound of O( 1/T ), which was shown under a strong bounded gradients assumption.
QuAFL: Federated Averaging Can Be Both Asynchronous and Communication-Efficient
- Computer ScienceArXiv
- 2022
This work jointly addresses two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between thecentral server and clients.
Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top
- Computer ScienceArXiv
- 2022
Theoretical convergence guarantees are derived for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions and the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients.
BEER: Fast O(1/T) Rate for Decentralized Nonconvex Optimization with Communication Compression
- Computer ScienceArXiv
- 2022
This paper proposes BEER, which adopts communication compression with gradient tracking, and shows it converges at a faster rate of O (1 /T ) than the state-of-the-art rate, by matching the rate without compression even under arbitrary data heterogeneity.
FL_PyTorch: optimization research simulator for federated learning
- Computer ScienceDistributedML@CoNEXT
- 2021
FL_PyTorch is a suite of open-source software written in python that builds on top of one the most popular research Deep Learning (DL) framework PyTorch to enable fast development, prototyping and experimenting with new and existing FL optimization algorithms.
Compression and Data Similarity: Combination of Two Techniques for Communication-Efficient Solving of Distributed Variational Inequalities
- Computer ScienceArXiv
- 2022
This paper considers a combination of two popular approaches: compression and data similarity, and shows that this synergy can be more effective than each of the approaches separately in solving distributed smooth strongly monotonic variational inequalities.
References
SHOWING 1-10 OF 49 REFERENCES
Federated Learning with Compression: Unified Analysis and Sharp Guarantees
- Computer ScienceAISTATS
- 2021
This work proposes a set of algorithms with periodical compressed (quantized or sparsified) communication and analyzes their convergence properties in both homogeneous and heterogeneous local data distributions settings and introduces a scheme to mitigate data heterogeneity.
Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations
- Computer ScienceIEEE Journal on Selected Areas in Information Theory
- 2020
This paper proposes Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients, and demonstrates that it converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers.
LIBSVM: A library for support vector machines
- Computer ScienceTIST
- 2011
Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization
- Computer ScienceICML
- 2021
The results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the theoretical results and confirming the practical superiority of PAGE.
A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization
- Computer ScienceArXiv
- 2020
This paper provides a single convergence analysis for all methods that satisfy the proposed unified assumption of the second moment of the stochastic gradient, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant.
On Biased Compression for Distributed Learning
- Computer ScienceArXiv
- 2020
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.
Distributed Learning with Compressed Gradient Differences
- Computer ScienceArXiv
- 2019
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
Federated Learning: Strategies for Improving Communication Efficiency
- Computer ScienceArXiv
- 2016
Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
- Computer ScienceICML
- 2020
This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
Decentralized Deep Learning with Arbitrary Communication Compression
- Computer ScienceICLR
- 2020
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.