Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis
@article{DomingoEnrich2022ComputingTV, title={Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis}, author={Carles Domingo-Enrich}, journal={ArXiv}, year={2022}, volume={abs/2206.00632} }
When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffleonce (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain…
Figures and Tables from this paper
References
SHOWING 1-10 OF 26 REFERENCES
Random Shuffling Beats SGD after Finite Epochs
- Computer ScienceICML
- 2019
It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations.
Why random reshuffling beats stochastic gradient descent
- Computer Science, MathematicsMath. Program.
- 2021
This paper provides various convergence rate results for RR and variants when the sum function is strongly convex, and shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize αk=Θ(1/ks) converges to zero.
Random Reshuffling: Simple Analysis with Vast Improvements
- Computer ScienceNeurIPS
- 2020
The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.
How Good is SGD with Random Shuffling?
- Computer ScienceCOLT 2019
- 2019
This paper proves that after $k$ passes over individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds.
Unified Optimal Analysis of the (Stochastic) Gradient Method
- Computer Science, MathematicsArXiv
- 2019
This note gives a simple proof for the convergence of stochastic gradient methods on $\mu$-convex functions under a (milder than standard) $L$-smoothness assumption and recovers the exponential convergence rate.
The Complexity of Finding Stationary Points with Stochastic Gradient Descent
- Computer ScienceICML
- 2020
It is shown that for nonconvex functions, the feasibility of minimizing gradients with SGD is surprisingly sensitive to the choice of optimality criteria, and this holds even if the authors limit ourselves to convex quadratic functions.
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
- Computer ScienceICML
- 2012
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.
Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm
- Computer Science, MathematicsNIPS
- 2014
An improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives is obtained, and it is shown how reweighting the sampling distribution is necessary in order to further improve convergence.
Understanding the Role of Momentum in Stochastic Gradient Methods
- Computer ScienceNeurIPS
- 2019
The general formulation of QHM is used to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions, and sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.
SGD without Replacement: Sharper Rates for General Smooth Convex Functions
- Computer ScienceICML
- 2019
The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at $O( 1/K) rate.