Corpus ID: 199064321

How Good is SGD with Random Shuffling?

@article{Safran2019HowGI,
  title={How Good is SGD with Random Shuffling?},
  author={Itay Safran and Ohad Shamir},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.00045}
}
We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting has been investigated in several recent works, but the optimal error rates remain unclear. In this… Expand
Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems
TLDR
It is proved that when the condition number is taken into account, without-replacement SGD does not significantly improve on with-replacements SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the conditionNumber. Expand
Permutation-Based SGD: Is Random Optimal?
TLDR
The results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random. Expand
On Tight Convergence Rates of Without-replacement SGD
TLDR
This work analyzing step sizes that vary across epochs of without-replacement SGD shows that the rates hold after $\kappa^c\log(nK)$ epochs for some $c>0$. Expand
Optimal Rates for Random Order Online Optimization
TLDR
This analysis relies on novel connections between algorithmic stability and generalization for sampling withoutreplacement analogous to those studied in the with-replacement i.i.d. setting, as well as on a refined average stability analysis of stochastic gradient descent. Expand
Random Reshuffling with Variance Reduction: New Analysis and Better Rates
TLDR
This work provides the first analysis of SVRG under Random Reshuffling (RR-SVRG) for general finite-sum problems and obtains the first sublinear rate for general convex problems. Expand
Shuffling Gradient-Based Methods with Momentum
We combine two advanced ideas widely used in optimization for machine learning: shuffling strategy and momentum technique to develop a novel shuffling gradient-based method with momentum toExpand
Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization
TLDR
This work revisits Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable stochastic extra-gradient algorithms which provably achieve faster convergence rates than existing approaches. Expand
On the Comparison between Cyclic Sampling and Random Reshuffling
TLDR
A norm is introduced, which is defined based on the sampling order, to measure the distance to solution and is applied on proximal Finito/MISO algorithm to identify the optimal fixed ordering, which can beat random reshuffling by a factor up to log(n)/n in terms of the best-known upper bounds. Expand
Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates
TLDR
This work presents two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions and establishes local linear-quadratic convergence results. Expand
Random Reshuffling is Not Always Better
TLDR
This work gives a counterexample to the Operator Inequality of Noncommutative Arithmetic and Geometric Means, a longstanding conjecture that relates to the performance of random reshuffle in learning algorithms, and gives an example of a learning task and algorithm for which with-replacement random sampling outperforms random reshuffling. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 20 REFERENCES
Random Shuffling Beats SGD after Finite Epochs
TLDR
It is proved that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/ T^3), where n is the number of components in the objective, and T is the total number of iterations. Expand
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
TLDR
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. Expand
Why random reshuffling beats stochastic gradient descent
TLDR
The convergence rate of the random reshuffling method is analyzed and it is shown that when the component functions are quadratics or smooth and the sum function is strongly convex, RR with iterate averaging and a diminishing stepsize converges at rate $\Theta(1/k^{2s})$ with probability one in the suboptimality of the objective value, thus improving upon the $\Omega( 1/k)$ rate of SGD. Expand
SGD without Replacement: Sharper Rates for General Smooth Convex Functions
TLDR
The first non-asymptotic results for stochastic gradient descent when applied to general smooth, strongly-convex functions are provided, which show that sgdwor converges at a rate of O(1/K^2) while sgd is known to converge at $O( 1/K) rate. Expand
Stochastic Learning Under Random Reshuffling With Constant Step-Sizes
TLDR
The analysis establishes analytically that random reshuffling outperforms uniform sampling and derives an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail, the differences between sampling with and without replacement. Expand
Without-Replacement Sampling for Stochastic Gradient Methods
  • O. Shamir
  • Computer Science, Mathematics
  • NIPS
  • 2016
TLDR
This paper provides competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data, yielding a nearly-optimal algorithm for regularized least squares under broad parameter regimes. Expand
Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences
Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomizedExpand
Robust Stochastic Approximation Approach to Stochastic Programming
TLDR
It is intended to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems. Expand
Convergence Rate of Incremental Gradient and Newton Methods
The incremental gradient method is a prominent algorithm for minimizing a finite sum of smooth convex functions, used in many contexts including large-scale data processing applications andExpand
Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms
1 Context Given a finite set of m examples z 1 ,. .. , z m and a strictly convex differen-tiable loss function ℓ(z, θ) defined on a parameter vector θ ∈ R d , we are interested in minimizing the costExpand
...
1
2
...