• Corpus ID: 215827700

# On Tight Convergence Rates of Without-replacement SGD

@article{Ahn2020OnTC,
title={On Tight Convergence Rates of Without-replacement SGD},
author={Kwangjun Ahn and Suvrit Sra},
journal={ArXiv},
year={2020},
volume={abs/2004.08657}
}
• Published 18 April 2020
• Computer Science
• ArXiv
For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD. Denoting by $n$ the number of components in the cost and $K$ the number of epochs of the algorithm , several recent works have shown convergence rates of without-replacement SGD that have better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD. However, there are two main limitations shared among those works: the rates have extra poly-logarithmic factors…
4 Citations

## Tables from this paper

• Computer Science
NeurIPS
• 2020
Stochastic Variance Reduction via Accelerated Dual Averaging improves complexity of the best known methods without use of any additional strategy such as optimal black-box reduction, and it leads to a unified convergence analysis and simplified algorithm for both the nonstrongly convex and strongly convex settings.
• Computer Science
NeurIPS
• 2020
The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.
• Computer Science
ArXiv
• 2022
This work presents a comprehensive theoretical analysis of FedShuffle and shows that it does not suffer from the objective function mismatch that is present in FL methods that assume homogeneous updates in heterogeneous FL setups, such as FedAvg (McMahan et al., 2017).
• Computer Science
ArXiv
• 2023
This research proposes a new technique and design a novel regularized client participation scheme that leads to a reduction in the variance caused by client sampling and combined with the popular FedAvg algorithm results in superior rates under standard assumptions.

## References

SHOWING 1-10 OF 14 REFERENCES

• Computer Science, Mathematics
ICML
• 2020
It is shown that SGD without replacement achieves a rate of $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^2}{ T^3}\right)$ when the sum of the functions is a quadratic, and a new lower bound is offered of $\Omega\left(frac{ n}{T ^2}\ right)$ for strongly convex functions that are sums of smooth functions.
• Computer Science
COLT 2019
• 2019
This paper proves that after $k$ passes over individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds.
• Computer Science
ICML
• 2019
The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at \$O( 1/K) rate.
• Computer Science
ICML
• 2019
It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations.
This paper provides competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data, yielding a nearly-optimal algorithm for regularized least squares under broad parameter regimes.
• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2021
This paper provides a unified convergence analysis for a class of shuffling-type gradient methods for solving a well-known finite-sum minimization problem commonly used in machine learning and introduces new non-asymptotic and asymptotic convergence rates.
• Mathematics, Computer Science
• 2001
An incremental approach to minimizing a convex function that consists of the sum of a large number of component functions is considered, which has been very successful in solving large differentiable least squares problems, such as those arising in the training of neural networks.
• Computer Science, Mathematics
Math. Program.
• 2021
This paper provides various convergence rate results for RR and variants when the sum function is strongly convex, and shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize αk=Θ(1/ks) converges to zero.
This work considers three ways to pick the example z[t] at each iteration of a stochastic gradient algorithm, which is interested in minimizing the cost function min θ C(θ) = 1 m m i=1 ℓ(z i, θ).
Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown tot he experiment, and it is desire to find the