Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis

  title={Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis},
  author={Carles Domingo-Enrich},
When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffleonce (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain… 

Figures and Tables from this paper



Random Shuffling Beats SGD after Finite Epochs

It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations.

Why random reshuffling beats stochastic gradient descent

This paper provides various convergence rate results for RR and variants when the sum function is strongly convex, and shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize αk=Θ(1/ks) converges to zero.

Random Reshuffling: Simple Analysis with Vast Improvements

The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.

How Good is SGD with Random Shuffling?

This paper proves that after $k$ passes over individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds.

Unified Optimal Analysis of the (Stochastic) Gradient Method

  • S. Stich
  • Computer Science, Mathematics
  • 2019
This note gives a simple proof for the convergence of stochastic gradient methods on $\mu$-convex functions under a (milder than standard) $L$-smoothness assumption and recovers the exponential convergence rate.

The Complexity of Finding Stationary Points with Stochastic Gradient Descent

It is shown that for nonconvex functions, the feasibility of minimizing gradients with SGD is surprisingly sensitive to the choice of optimality criteria, and this holds even if the authors limit ourselves to convex quadratic functions.

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.

Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm

An improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives is obtained, and it is shown how reweighting the sampling distribution is necessary in order to further improve convergence.

Understanding the Role of Momentum in Stochastic Gradient Methods

The general formulation of QHM is used to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions, and sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.

SGD without Replacement: Sharper Rates for General Smooth Convex Functions

The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at $O( 1/K) rate.