# Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

@article{DomingoEnrich2022ComputingTV,
title={Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis},
author={Carles Domingo-Enrich},
journal={ArXiv},
year={2022},
volume={abs/2206.00632}
}
When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffleonce (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain…

## References

SHOWING 1-10 OF 26 REFERENCES

• Computer Science
ICML
• 2019
It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations.
• Computer Science, Mathematics
Math. Program.
• 2021
This paper provides various convergence rate results for RR and variants when the sum function is strongly convex, and shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize αk=Θ(1/ks) converges to zero.
• Computer Science
NeurIPS
• 2020
The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.
• Computer Science
COLT 2019
• 2019
This paper proves that after $k$ passes over individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds.
• S. Stich
• Computer Science, Mathematics
ArXiv
• 2019
This note gives a simple proof for the convergence of stochastic gradient methods on $\mu$-convex functions under a (milder than standard) $L$-smoothness assumption and recovers the exponential convergence rate.
• Computer Science
ICML
• 2020
It is shown that for nonconvex functions, the feasibility of minimizing gradients with SGD is surprisingly sensitive to the choice of optimality criteria, and this holds even if the authors limit ourselves to convex quadratic functions.
• Computer Science
ICML
• 2012
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.
• Computer Science, Mathematics
NIPS
• 2014
An improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives is obtained, and it is shown how reweighting the sampling distribution is necessary in order to further improve convergence.
• Computer Science
NeurIPS
• 2019
The general formulation of QHM is used to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions, and sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.
• Computer Science
ICML
• 2019
The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at \$O( 1/K) rate.