Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses
@inproceedings{MarteauFerey2019GloballyCN, title={Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses}, author={Ulysse Marteau-Ferey and Francis R. Bach and Alessandro Rudi}, booktitle={Neural Information Processing Systems}, year={2019} }
In this paper, we study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric…
22 Citations
Generalized self-concordant analysis of Frank–Wolfe algorithms
- Computer Science, MathematicsMathematical Programming
- 2022
This paper closes the apparent gap in the literature by developing provably convergent Frank–Wolfe algorithms with standard O(1/k) convergence rate guarantees, and shows how these sublinearly convergent methods can be accelerated to yield linearly Convergent projection-free methods.
Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation
- Computer Science, MathematicsAISTATS
- 2020
Stochastic second-order methods for minimizing smooth and strongly-convex functions under an interpolation condition satisfied by over-parameterized models are considered and the regularized subsampled Newton method (R-SSN) achieves global linear convergence with an adaptive step-size and a constant batch-size.
Regularized Newton Method with Global O(1/k2) Convergence
- Mathematics, Computer ScienceArXiv
- 2021
A Newton-type method that converges fast from any initialization and for arbitrary convex objectives with Lipschitz Hessians is presented, and it is proved that locally the method converges superlinearly when the objective is strongly convex.
Unifying Width-Reduced Methods for Quasi-Self-Concordant Optimization
- Computer ScienceNeurIPS
- 2021
This work presents the first unified width reduction method for carefully handling quasi-selfconcordant losses and directly achieves mtype rates in the constrained setting without the need for any explicit acceleration schemes, thus naturally complementing recent work based on a ball-oracle approach.
Beyond Tikhonov: Faster Learning with Self-Concordant Losses via Iterative Regularization
- MathematicsNeurIPS
- 2021
This paper shows that fast and optimal rates can be achieved for GSC by using the iterated Tikhonov regularization scheme, which is intrinsically related to the proximal point method in optimization, and overcomes the limitation of the classical Tikh onv regularization.
Asynchronous Parallel Stochastic Quasi-Newton Methods
- Computer ScienceParallel Comput.
- 2021
A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids
- Mathematics, Computer ScienceThe Annals of Statistics
- 2022
A sieve stochastic gradient descent estimator (Sieve-SGD) when the hypothesis space is a Sobolev ellipsoid is proposed and it is shown that Sieve- SGD has rate-optimal mean squared error (MSE) under a set of simple and direct conditions.
Distributed Saddle-Point Problems Under Similarity
- Computer ScienceArXiv
- 2021
This work studies solution methods for (strongly-)convex-(strongly)-concave Saddle-Point Problems (SPPs) over networks of two type–master/workers architectures and mesh networks and proposes algorithms matching the lower bounds over either types of networks (up to log-factors).
Kernel methods through the roof: handling billions of points efficiently
- Computer ScienceNeurIPS
- 2020
This work designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization.
Learning new physics efficiently with nonparametric methods
- Computer ScienceThe European Physical Journal C
- 2022
This work presents a machine learning approach for model-independent new physics searches that has dramatic advantages compared to neural network implementations in terms of training times and computational resources, while maintaining comparable performances.
References
SHOWING 1-10 OF 41 REFERENCES
Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence
- Computer Science, MathematicsSIAM J. Optim.
- 2017
A randomized second-order method for optimization known as the Newton Sketch, based on performing an approximate Newton step using a randomly projected or sub-sampled Hessian, is proposed, which has super-linear convergence with exponentially high probability and convergence and complexity guarantees that are independent of condition numbers and related problem-dependent quantities.
Convergence rates of sub-sampled Newton methods
- Computer ScienceNIPS
- 2015
This paper uses sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newton's method, yet has much smaller per-iteration cost.
Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression
- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2014
After N iterations, with a constant step-size proportional to 1/R2√N where N is the number of observations and R is the maximum norm of the observations, the convergence rate is always of order O(1/ √N), and improves to O(R2/µN), which shows that averaged stochastic gradient is adaptive to unknown local strong convexity of the objective function.
An optimal randomized incremental gradient method
- Computer ScienceMath. Program.
- 2018
It is shown that the total number of gradient evaluations performed by RPDG can be several times smaller, both in expectation and with high probability, than those performed by deterministic optimal first-order methods under favorable situations.
Sub-sampled Newton methods
- Computer Science, MathematicsMath. Program.
- 2019
For large-scale finite-sum minimization problems, we study non-asymptotic and high-probability global as well as local convergence properties of variants of Newton’s method where the Hessian and/or…
Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients
- MathematicsArXiv
- 2018
It is shown that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable, and holds even if an approximate Hessian is used, and if the subproblems are only solved approximately.
Linear Convergence with Condition Number Independent Access of Full Gradients
- Computer ScienceNIPS
- 2013
This paper proposes to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function, and presents a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients.
Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order Optimization
- Computer ScienceNeurIPS
- 2018
This work develops the first accelerated (deterministic and stochastic) quasi-Newton updates, which lead to provably more aggressive approximations of the inverse Hessian, and lead to speed-ups over classical non-accelerated rules in numerical experiments.
Optimal Rates for the Regularized Least-Squares Algorithm
- Mathematics, Computer ScienceFound. Comput. Math.
- 2007
A complete minimax analysis of the problem is described, showing that the convergence rates obtained by regularized least-squares estimators are indeed optimal over a suitable class of priors defined by the considered kernel.
Exact and Inexact Subsampled Newton Methods for Optimization
- Computer Science
- 2016
This paper analyzes an inexact Newton method that solves linear systems approximately using the conjugate gradient (CG) method, and that samples the Hessian and not the gradient (the gradient is assumed to be exact).