• Corpus ID: 215827700

On Tight Convergence Rates of Without-replacement SGD

  title={On Tight Convergence Rates of Without-replacement SGD},
  author={Kwangjun Ahn and Suvrit Sra},
For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD. Denoting by $n$ the number of components in the cost and $K$ the number of epochs of the algorithm , several recent works have shown convergence rates of without-replacement SGD that have better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD. However, there are two main limitations shared among those works: the rates have extra poly-logarithmic factors… 

Tables from this paper

Stochastic Variance Reduction via Accelerated Dual Averaging for Finite-Sum Optimization

Stochastic Variance Reduction via Accelerated Dual Averaging improves complexity of the best known methods without use of any additional strategy such as optimal black-box reduction, and it leads to a unified convergence analysis and simplified algorithm for both the nonstrongly convex and strongly convex settings.

Random Reshuffling: Simple Analysis with Vast Improvements

The theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times and proves fast convergence of the Shuffle-Once algorithm, which shuffles the data only once.

FedShuffle: Recipes for Better Use of Local Work in Federated Learning

This work presents a comprehensive theoretical analysis of FedShuffle and shows that it does not suffer from the objective function mismatch that is present in FL methods that assume homogeneous updates in heterogeneous FL setups, such as FedAvg (McMahan et al., 2017).

Federated Learning with Regularized Client Participation

This research proposes a new technique and design a novel regularized client participation scheme that leads to a reduction in the variance caused by client sampling and combined with the popular FedAvg algorithm results in superior rates under standard assumptions.



Closing the convergence gap of SGD without replacement

It is shown that SGD without replacement achieves a rate of $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^2}{ T^3}\right)$ when the sum of the functions is a quadratic, and a new lower bound is offered of $\Omega\left(frac{ n}{T ^2}\ right)$ for strongly convex functions that are sums of smooth functions.

How Good is SGD with Random Shuffling?

This paper proves that after $k$ passes over individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $\Omega(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds.

SGD without Replacement: Sharper Rates for General Smooth Convex Functions

The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at $O( 1/K) rate.

Random Shuffling Beats SGD after Finite Epochs

It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations.

Without-Replacement Sampling for Stochastic Gradient Methods

This paper provides competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data, yielding a nearly-optimal algorithm for regularized least squares under broad parameter regimes.

A Unified Convergence Analysis for Shuffling-Type Gradient Methods

This paper provides a unified convergence analysis for a class of shuffling-type gradient methods for solving a well-known finite-sum minimization problem commonly used in machine learning and introduces new non-asymptotic and asymptotic convergence rates.

Convergence Rate of Incremental Subgradient Algorithms

An incremental approach to minimizing a convex function that consists of the sum of a large number of component functions is considered, which has been very successful in solving large differentiable least squares problems, such as those arising in the training of neural networks.

Why random reshuffling beats stochastic gradient descent

This paper provides various convergence rate results for RR and variants when the sum function is strongly convex, and shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize αk=Θ(1/ks) converges to zero.

Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms

This work considers three ways to pick the example z[t] at each iteration of a stochastic gradient algorithm, which is interested in minimizing the cost function min θ C(θ) = 1 m m i=1 ℓ(z i, θ).

A Stochastic Approximation Method

Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown tot he experiment, and it is desire to find the