Corpus ID: 10627340

Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions

@article{Dfossez2014ConstantSS,
  title={Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions},
  author={Alexandre D{\'e}fossez and Francis R. Bach},
  journal={ArXiv},
  year={2014},
  volume={abs/1412.0156}
}
We consider the least-squares regression problem and provide a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent (a.k.a. least-mean-squares). In the strongly-convex case, we provide an asymptotic expansion up to explicit exponentially decaying terms. Our analysis leads to new insights into stochastic approximation algorithms: (a) it gives a tighter bound on the allowed step-size; (b) the generalization error may be divided into a variance… Expand
Competing with the Empirical Risk Minimizer in a Single Pass
TLDR
This work provides a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. Expand
From Averaging to Acceleration, There is Only a Step-size
We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant parameter second-order differenceExpand
Iterate averaging as regularization for stochastic gradient descent
TLDR
A variant of the classic Polyak-Ruppert averaging scheme, broadly used in stochastic gradient methods, is proposed, which considers a weighted average, with weights decaying in a geometric fashion, in the context of linear least squares regression. Expand
Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning
TLDR
This work develops a novel recipe for their finite sample analysis, and provides a concentration bound, which is the first such result for a two-timescale SA, and introduces a new projection scheme, in which the time between successive projections increases exponentially. Expand
New Insights and Perspectives on the Natural Gradient Method
  • James Martens
  • Computer Science, Mathematics
  • J. Mach. Learn. Res.
  • 2020
TLDR
This paper critically analyze this method and its properties, and shows how it can be viewed as a type of approximate 2nd-order optimization method, where the Fisher information matrix can be view as an approximation of the Hessian. Expand
Gradient Diversity: a Key Ingredient for Scalable Distributed Learning
TLDR
It is proved that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. Expand
Gradient Diversity Empowers Distributed Learning
TLDR
It is proved that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. Expand
Gradient Diversity Empowers Distributed Learning: Convergence and Stability of Mini-batch SGD
TLDR
It is proved that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. Expand
Exponential convergence of testing error for stochastic gradient methods
TLDR
It is shown that while the excess testing loss converges slowly to zero as the number of observations goes to infinity, the testing error converges exponentially fast if low-noise conditions are assumed. Expand
Solving Empirical Risk Minimization in the Current Matrix Multiplication Time
TLDR
An algorithm that runs in time and proposes an efficient data-structure to maintain the central path of interior point methods even when the weights update vector is dense to generalize the very recent result of solving linear programs in the current matrix multiplication time to more broad class of problems. Expand
...
1
2
...

References

SHOWING 1-10 OF 20 REFERENCES
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework whichExpand
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
TLDR
This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant. Expand
Statistical analysis of stochastic gradient methods for generalized linear models
TLDR
This work develops a computationally efficient algorithm to implement implicit SGD learning of GLMs and obtains exact formulas for the bias and variance of both updates which leads to important observations on their comparative statistical properties. Expand
Stochastic Optimization with Importance Sampling
TLDR
Stochastic optimization with importance sampling is studied, which improves the convergence rate by reducing the stochastic variance and the convergence rates with the proposed importance sampling methods can be significantly improved under suitable conditions both for prox-SGD and for proxies-SDCA. Expand
Active learning algorithm using the maximum weighted log-likelihood estimator
Abstract We study the problems of constructing designs for the regression problems. Our aim is to estimate the mean value of the response variable. The distribution of the independent variable isExpand
Analysis of the normalized LMS algorithm with Gaussian inputs
  • N. Bershad
  • Mathematics, Computer Science
  • IEEE Trans. Acoust. Speech Signal Process.
  • 1986
TLDR
The transient mean and second-moment behavior of the modified LMS (NLMS) algorithm are evaluated, taking into account the explicit statistical dependence of μ upon the input data. Expand
Convergence Rate of Incremental Subgradient Algorithms
We consider a class of subgradient methods for minimizing a convex function that consists of the sum of a large number of component functions. This type of minimization arises in a dual context fromExpand
The Tradeoffs of Large Scale Learning
This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case ofExpand
Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems
  • Y. Nesterov
  • Mathematics, Computer Science
  • SIAM J. Optim.
  • 2012
TLDR
Surprisingly enough, for certain classes of objective functions, the proposed methods for solving huge-scale optimization problems are better than the standard worst-case bounds for deterministic algorithms. Expand
Towards good practice in large-scale learning for image classification
TLDR
It is shown that for one-vs-rest, learning through cross-validation the optimal degree of imbalance between the positive and the negative samples can have a significant impact and early stopping can be used as an effective regularization strategy when training with stochastic gradient algorithms. Expand
...
1
2
...