Analysis of Error Feedback in Federated Non-Convex Optimization with Biased Compression

  title={Analysis of Error Feedback in Federated Non-Convex Optimization with Biased Compression},
  author={Xiaoyun Li and Ping Li},
In practical federated learning (FL) systems, e.g., wireless networks, the communication cost between the clients and the central server can often be a bottleneck. To reduce the communication cost, the paradigm of communication compression has become a popular strategy in the literature. In this paper, we focus on biased gradient compression techniques in non-convex FL problems. In the classical setting of distributed learning, the method of error feedback (EF) is a common technique to remedy… 



Federated Learning with Compression: Unified Analysis and Sharp Guarantees

This work proposes a set of algorithms with periodical compressed (quantized or sparsified) communication and analyzes their convergence properties in both homogeneous and heterogeneous local data distributions settings and introduces a scheme to mitigate data heterogeneity.

Fed-LAMB: Layerwise and Dimensionwise Locally Adaptive Optimization Algorithm

This paper presents Fed-LAMB, a novel federated learning method based on a layer-wise and dimension-wise updates of the local models, alleviating the nonconvexity and the multi-layered nature of the optimization task at hand, which achieves faster convergence speed and better generalization performance, compared to the state-of-the-art.

EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

Six practical extensions of EF21 are proposed, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression.

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

A general distributed compressed SGD with Nesterov's momentum is proposed, which achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\% less wall clock time.

SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

This work obtains tight convergence rates for FedAvg and proves that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence, and proposes a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the ` client-drifts' in its local updates.

Linear Convergence in Federated Learning: Tackling Client Heterogeneity and Sparse Gradients

This work is the first to provide tight linear convergence rate guarantees, and constitutes the first comprehensive analysis of gradient sparsification in FL.

On the Convergence of FedAvg on Non-IID Data

This paper analyzes the convergence of Federated Averaging on non-iid data and establishes a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs.

Breaking the centralized barrier for cross-device federated learning

This work proposes a general algorithmic framework, MIME, which mitigates client drift and adapts an arbitrary centralized optimization algorithm such as momentum and Adam to the cross-device federated learning setting and proves that MIME is provably faster than any centralized method.

FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization

FedPAQ is presented, a communication-efficient Federated Learning method with Periodic Averaging and Quantization that achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by the method.

A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication

This paper studies the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization and proposes a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate.