# A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent

@inproceedings{Gorbunov2020AUT, title={A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent}, author={Eduard A. Gorbunov and Filip Hanzely and Peter Richt{\'a}rik}, booktitle={AISTATS}, year={2020} }

In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent ({\tt SGD}) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. We show that our framework includes methods with and without the following tricks, and their combinations: variance reduction, importance sampling, mini-batch sampling, quantization, and coordinate sub…

## 52 Citations

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

- Computer Science, MathematicsArXiv
- 2020

A unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer is presented and the minibatch size is determined that improves the theoretical total complexity of the methods but also improves their convergence in practice.

Linearly Converging Error Compensated SGD

- Computer Science, MathematicsNeurIPS
- 2020

A unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates is proposed and the first method called EC-SGD-DIANA is proposed which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with a constant learning rate.

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

- Computer Science, MathematicsArXiv
- 2019

This work proposes a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both and provides a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions.

A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization

- Computer Science, MathematicsArXiv
- 2020

This paper provides a single convergence analysis for all methods that satisfy the proposed unified assumption of the second moment of the stochastic gradient, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant.

High probability convergence and uniform stability bounds for nonconvex stochastic gradient descent

- Mathematics
- 2020

Stochastic gradient descent (with a mini-batch) is one of the most common iterative algorithms used in machine learning. While being computationally cheap to implement, recent literature suggests…

Stochastic Hamiltonian Gradient Methods for Smooth Games

- Computer Science, MathematicsICML
- 2020

This work proposes a novel unbiased estimator for the stochastic Hamiltonian gradient descent (SHGD) and shows that SHGD converges linearly to the neighbourhood of a stationary point, and provides the first global non-asymptotic last-iterate convergence guarantees for certain classes of Stochastic smooth games.

Random Reshuffling with Variance Reduction: New Analysis and Better Rates

- Computer Science, MathematicsArXiv
- 2021

This work provides the first analysis of SVRG under Random Reshuffling (RR-SVRG) for general finite-sum problems and obtains the first sublinear rate for general convex problems.

MURANA: A Generic Framework for Stochastic Variance-Reduced Optimization

- Computer Science, MathematicsArXiv
- 2021

A generic variance-reduced algorithm for minimizing a sum of several smooth functions plus a regularizer, in a sequential or distributed manner, formulated with general stochastic operators, which allow it to cover many existing randomization mechanisms within a unified framework.

Error Compensated Proximal SGD and RDA

- 2020

Communication cost is a key bottleneck in distributed training of large machine learning models. In order to reduce the amount of communicated data, quantization and error compensation techniques…

Proximal Splitting Algorithms for Convex Optimization: A Tour of Recent Advances, with New Twists

- Mathematics
- 2019

Convex nonsmooth optimization problems, whose solutions live in very high dimensional spaces, have become ubiquitous. To solve them, the class of first-order algorithms known as proximal splitting…

## References

SHOWING 1-10 OF 56 REFERENCES

SGD: General Analysis and Improved Rates

- Computer Science, MathematicsICML 2019
- 2019

This theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches, and can determine the mini-batch size that optimizes the total complexity.

Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

- Computer Science, MathematicsIEEE Journal of Selected Topics in Signal Processing
- 2016

It is proved that as long as b is below a certain threshold, the authors can reach any predefined accuracy with less overall work than without mini-batching, and is suitable for further acceleration by parallelization.

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

- Computer Science, MathematicsArXiv
- 2019

This work proposes a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both and provides a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions.

Coordinate descent with arbitrary sampling I: algorithms and complexity†

- Mathematics, Computer ScienceOptim. Methods Softw.
- 2016

A complexity analysis of ALPHA is provided, from which it is deduced as a direct corollary complexity bounds for its many variants, all matching or improving best known bounds.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

- Computer ScienceNIPS
- 2017

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.

Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches

- Mathematics, Computer ScienceAISTATS
- 2019

This paper designs new importance sampling for mini-batch ACD which significantly outperforms previous state-of-the-art minibatch ACD in practice and proves a rate that is at most three times worse than the rate of minibatches ACD with uniform sampling, but can be three times better.

SEGA: Variance Reduction via Gradient Sketching

- Computer Science, MathematicsNeurIPS
- 2018

We propose a randomized first order optimization method--SEGA (SkEtched GrAdient method)-- which progressively throughout its iterations builds a variance-reduced estimate of the gradient from random…

Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling

- Mathematics, Computer ScienceNIPS
- 2015

This work proposes and analyzes a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution.

The Convergence of Sparsified Gradient Methods

- Computer Science, MathematicsNeurIPS
- 2018

It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.

Estimate Sequences for Variance-Reduced Stochastic Composite Optimization

- Mathematics, Computer ScienceICML
- 2019

A unified view of gradient-based algorithms for stochastic convex composite optimization is proposed by extending the concept of estimate sequence introduced by Nesterov by providing a generic proof of convergence for the approaches SAGA, SVRG, and has several advantages.