Why random reshuffling beats stochastic gradient descent

@article{Grbzbalaban2021WhyRR,
  title={Why random reshuffling beats stochastic gradient descent},
  author={Mert G{\"u}rb{\"u}zbalaban and Asuman E. Ozdaglar and Pablo A. Parrilo},
  journal={Math. Program.},
  year={2021},
  volume={186},
  pages={49-84}
}
We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with… 

Random Reshuffling: Simple Analysis with Vast Improvements

TLDR
The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.

Random Shuffling Beats SGD after Finite Epochs

TLDR
It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations.

Convergence of Random Reshuffling Under The Kurdyka-Łojasiewicz Inequality

TLDR
Under the well-known Kurdyka-Łojasiewicz (KL) inequality, strong limit-point convergence results for RR with appropriate diminishing step sizes are established, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense.

How Good is SGD with Random Shuffling?

TLDR
This paper proves that after $k$ passes over individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds.

Stochastic Learning Under Random Reshuffling With Constant Step-Sizes

TLDR
The analysis establishes analytically that random reshuffling outperforms uniform sampling and derives an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail, the differences between sampling with and without replacement.

Distributed Random Reshuffling over Networks

TLDR
It is shown that D-RR inherits favorable characteristics of RR for both smooth strongly convex and smooth nonconvex objective functions, and convergence results match those of centralized RR and outperform the distributed stochastic gradient descent (DSGD) algorithm if the authors run a relatively large number of epochs.

On the Comparison between Cyclic Sampling and Random Reshuffling

TLDR
A norm is introduced, which is defined based on the sampling order, to measure the distance to solution and is applied on proximal Finito/MISO algorithm to identify the optimal fixed ordering, which can beat random reshuffling by a factor up to log(n)/n in terms of the best-known upper bounds.

Proximal and Federated Random Reshuffling

TLDR
Two new algorithms, Proximal and Federated Random Reshuffing (ProxRR and FedRR), which solve composite convex finitesum minimization problems in which the objective is the sum of a (potentially non-smooth) convex regularizer and an average of n smooth objectives are proposed.

On the performance of random reshuffling in stochastic learning

TLDR
The analysis establishes analytically that random reshuffling outperforms independent sampling by showing that the iterate at the end of each run approaches a smaller neighborhood of size O( μ2) around the minimizer rather than O(μ).

STOCHASTIC GRADIENT DESCENT

TLDR
This paper develops a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings, and proposes two new example-selection approaches using quasi-Monte-Carlo methods.
...

References

SHOWING 1-10 OF 46 REFERENCES

Convergence Rate of Incremental Gradient and Newton Methods

TLDR
This paper presents fast convergence results for the incremental gradient and incremental Newton methods under the constant and diminishing stepsizes and shows that to achieve the fastest 1/k rate, incremental gradient needs a stepsize that requires tuning to the strong convexity parameter whereas the incremental Newton method does not.

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

TLDR
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.

Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent?

TLDR
The question is whether averaging is needed at all to get optimal rates of stochastic gradient descent, and the algorithm makes use of an oracle, which gives a random vector ^ whose expectation is a subgradient ofF (w).

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

TLDR
This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant.

Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

We present an algorithm for minimizing a sum of functions that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged by

Toward a Noncommutative Arithmetic-geometric Mean Inequality: Conjectures, Case-studies, and Consequences

TLDR
Focusing on least means squares optimization, a noncommutative arithmetic-geometric mean inequality is formulated that would prove that the expected convergence rate of without-replacement sampling is faster than that of with-replacements sampling.

Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms

TLDR
This work considers three ways to pick the example z[t] at each iteration of a stochastic gradient algorithm, which is interested in minimizing the cost function min θ C(θ) = 1 m m i=1 ℓ(z i, θ).

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

TLDR
It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

TLDR
A unified algorithmic framework is introduced for incremental methods for minimizing a sum P m=1 fi(x) consisting of a large number of convex component functions fi, including the advantages offered by randomization in the selection of components.

Parallel stochastic gradient algorithms for large-scale matrix completion

TLDR
Jellyfish, an algorithm for solving data-processing problems with matrix-valued decision variables regularized to have low rank, is developed, which is orders of magnitude more efficient than existing codes.