• Publications
  • Influence
Tighter Theory for Local SGD on Identical and Heterogeneous Data
We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
Random Reshuffling: Simple Analysis with Vast Improvements
TLDR
The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.
Stochastic Distributed Learning with Gradient Quantization and Variance Reduction
TLDR
These are the first methods that achieve linear convergence for arbitrary quantized updates in distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server.
First Analysis of Local GD on Heterogeneous Data
TLDR
It is shown that in a low accuracy regime, the local gradient descent method has the same communication complexity as gradient descent.
Revisiting Stochastic Extragradient
TLDR
This work fixes a fundamental issue in the stochastic extragradient method by providing a new sampling strategy that is motivated by approximating implicit updates, and proves guarantees for solving variational inequality that go beyond existing settings.
Proximal and Federated Random Reshuffling
TLDR
Two new algorithms, Proximal and Federated Random Reshuffing (ProxRR and FedRR), which solve composite convex finitesum minimization problems in which the objective is the sum of a (potentially non-smooth) convex regularizer and an average of n smooth objectives are proposed.
SEGA: Variance Reduction via Gradient Sketching
We propose a randomized first order optimization method--SEGA (SkEtched GrAdient method)-- which progressively throughout its iterations builds a variance-reduced estimate of the gradient from random
A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning
TLDR
This work proposes and analyzes a flexible asynchronous optimization algorithm for solving nonsmooth learning problems and proves that the algorithm converges linearly with a fixed learning rate that does not depend on communication delays nor on the number of machines.
MISO is Making a Comeback With Better Proofs and Rates
TLDR
Numerical experiments show that MISO is a serious competitor to SAGA and SVRG and sometimes outperforms them on real datasets and derives minibatching bounds with arbitrary uniform sampling that lead to linear speedup when the expected minibatch size is in a certain range.
...
...