• Corpus ID: 237420500

On Second-order Optimization Methods for Federated Learning

  title={On Second-order Optimization Methods for Federated Learning},
  author={Sebastian Bischoff and Stephan Gunnemann and Martin Jaggi and Sebastian U. Stich},
We consider federated learning (FL), where the training data is distributed across a large number of clients. The standard optimization method in this setting is Federated Averaging (FedAvg), which performs multiple local first-order optimization steps between communication rounds. In this work, we evaluate the performance of several second-order distributed methods with local steps in the FL setting which promise to have favorable convergence properties. We (i) show that FedAvg performs… 

Figures and Tables from this paper

FedSSO: A Federated Server-Side Second-Order Optimization Algorithm

This work employs a server-side proximation for the Quasi-Newton method without requiring any training data from the clients, to shift the computation burden from clients to server, and eliminate the additional communication for second-order updates between clients and server entirely.

Over-the-Air Federated Learning via Second-Order Optimization

This paper proposes a novel over-the-air second-order federated optimization algorithm to simultaneously reduce the communication rounds and enable low-latency global model aggregation by exploiting the waveform superposition property of a multi-access channel to implement the distributed second- order optimization algorithm over wireless networks.



SCAFFOLD: Stochastic Controlled Averaging for Federated Learning

This work obtains tight convergence rates for FedAvg and proves that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence, and proposes a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the ` client-drifts' in its local updates.

FedDANE: A Federated Newton-Type Method

This work proposes FedDANE, an optimization method that is adapted from DANE, a method for classical distributed optimization, to handle the practical constraints of federated learning, and provides convergence guarantees for this method when learning over both convex and non-convex functions.

Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning.

This work proposes a general framework Mime which mitigates client-drift and adapts arbitrary centralized optimization algorithms to federated learning and strongly establishes Mime's superiority over other baselines.

FedNL: Making Newton-Type Methods Applicable to Federated Learning

This work proposes a family of Federated Newton Learn methods, which it is believed is a marked step in the direction of making second-order methods applicable to FL, and employs a Hessian learning technique which enhances privacy, provably learns the Hessian at the optimum, and provably works with general contractive compression operators.

Advances and Open Problems in Federated Learning

Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

CoCoA: A General Framework for Communication-Efficient Distributed Optimization

This work presents a general-purpose framework for distributed computing environments, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing, and extends the framework to cover general non-strongly-convex regularizers, including L1-regularized problems like lasso.

LocalNewton: Reducing Communication Bottleneck for Distributed Learning

This work proposes LocalNewton, a distributed second-order algorithm with local averaging, and devise an adaptive scheme to choose L that reduces the number of local iterations in worker machines between two model synchronizations as the training proceeds, successively refining the model quality at the master.

A Distributed Second-Order Algorithm You Can Trust

A new algorithm for distributed training of generalized linear models that only requires the computation of diagonal blocks of the Hessian matrix on the individual workers and dynamically adapts the auxiliary model to compensate for modeling errors is presented.

Local SGD Converges Fast and Communicates Little

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.

Communication-Efficient Learning of Deep Networks from Decentralized Data

This work presents a practical method for the federated learning of deep networks based on iterative model averaging, and conducts an extensive empirical evaluation, considering five different model architectures and four datasets.