Corpus ID: 211171485

Is Local SGD Better than Minibatch SGD?

@article{Woodworth2020IsLS,
  title={Is Local SGD Better than Minibatch SGD?},
  author={Blake E. Woodworth and Kumar Kshitij Patel and Sebastian U. Stich and Zhen Dai and Brian Bullins and H. B. McMahan and O. Shamir and Nathan Srebro},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.07839}
}
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex… Expand
43 Citations
Minibatch vs Local SGD for Heterogeneous Distributed Learning
  • 28
  • PDF
Linearly Converging Error Compensated SGD
  • 11
  • PDF
Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency
  • Highly Influenced
  • PDF
The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Updates
  • 1
  • Highly Influenced
  • PDF
Local SGD for Saddle-Point Problems
  • 2
  • PDF
Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning
  • PDF
Taming GANs with Lookahead
  • 3
  • PDF
Distributed Sparse SGD with Majority Voting
  • PDF
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 43 REFERENCES
Don't Use Large Mini-Batches, Use Local SGD
  • 151
  • PDF
Local SGD Converges Fast and Communicates Little
  • 259
  • PDF
Better Communication Complexity for Local SGD
  • 11
Efficient mini-batch training for stochastic optimization
  • 435
  • PDF
Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification
  • 64
  • PDF
Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning
  • 145
  • PDF
Parallel SGD: When does averaging help?
  • 59
  • PDF
Tighter Theory for Local SGD on Identical and Heterogeneous Data
  • 65
  • PDF
Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization
  • 52
  • PDF
...
1
2
3
4
5
...