# Second-Order Stochastic Optimization for Machine Learning in Linear Time

@article{Agarwal2017SecondOrderSO, title={Second-Order Stochastic Optimization for Machine Learning in Linear Time}, author={Naman Agarwal and Brian Bullins and Elad Hazan}, journal={J. Mach. Learn. Res.}, year={2017}, volume={18}, pages={116:1-116:40} }

First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored due to the high cost of computing the second-order information. In this paper we develop second-order stochastic methods for optimization problems in machine learning that match the per-iteration cost of gradient based methods, and in certain settings improve…

## 127 Citations

### Stochastic second-order optimization for over-parameterized machine learning models

- Computer Science
- 2020

We consider stochastic second-order methods for minimizing smooth and stronglyconvex functions under an interpolation condition, which can be satisfied by overparameterized machine learning models.…

### Stochastic sub-sampled Newton method with variance reduction

- Computer Science, MathematicsInt. J. Wavelets Multiresolution Inf. Process.
- 2019

Stochastic optimization on large-scale machine learning problems has been developed dramatically since stochastic gradient methods with variance reduction technique were introduced. Several…

### Progressive Batching for Efficient Non-linear Least Squares

- Computer ScienceACCV
- 2020

This work presents an approach for non-linear least-squares that guarantees convergence while at the same time significantly reduces the required amount of computation, and shows that the proposed method achieves competitive convergence rates compared to traditional second-order approaches on common computer vision problems.

### Oracle Complexity of Second-Order Methods for Finite-Sum Problems

- Computer ScienceICML
- 2017

Evidence that the answer to can second-order information indeed be used to solve finite-sum optimization problems more efficiently is provided, at least in terms of worst-case guarantees is provided.

### Approximate Newton Methods and Their Local Convergence

- Computer ScienceICML
- 2017

This paper proposes a unifying framework to analyze local convergence properties of second order methods and proposes a theoretical analysis that matches the performance in real applications.

### Interpolation, growth conditions, and stochastic gradient descent

- Computer Science
- 2020

The notion of interpolation is extended to stochastic optimization problems with general, first-order oracles, and a simple extension to `2-regularized minimization is provided, which opens the path to proximal-gradient methods and non-smooth optimization under interpolation.

### Finding Local Minima for Nonconvex Optimization in Linear Time

- Computer Science
- 2016

A non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which is linear in the input representation and applies to a very general class of optimization problems including training a neural network and many other non- Convex objectives arising in machine learning.

### SPAN: A Stochastic Projected Approximate Newton Method

- Computer ScienceAAAI
- 2020

This paper proposes SPAN, a novel approximate and fast Newton method that computes the inverse of the Hessian matrix via low-rank approximation and stochastic Hessian-vector products and achieves a better trade-off between the convergence rate and the per-iteration efficiency.

### Approximate Newton Methods

- Computer ScienceJ. Mach. Learn. Res.
- 2021

This paper proposes a unifying framework to analyze both local and global convergence properties of second order methods and presents the theoretical results which match the performance in real applications well.

### Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order Optimization

- Computer ScienceNeurIPS
- 2018

This work develops the first accelerated (deterministic and stochastic) quasi-Newton updates, which lead to provably more aggressive approximations of the inverse Hessian, and lead to speed-ups over classical non-accelerated rules in numerical experiments.

## References

SHOWING 1-10 OF 48 REFERENCES

### Oracle Complexity of Second-Order Methods for Finite-Sum Problems

- Computer ScienceICML
- 2017

Evidence that the answer to can second-order information indeed be used to solve finite-sum optimization problems more efficiently is provided, at least in terms of worst-case guarantees is provided.

### On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning

- Computer ScienceSIAM J. Optim.
- 2011

Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration.

### A Stochastic Quasi-Newton Method for Online Convex Optimization

- Computer ScienceAISTATS
- 2007

Stochastic variants of the wellknown BFGS quasi-Newton optimization method, in both full and memory-limited (LBFGS) forms, are developed for online optimization of convex functions, which asymptotically outperforms previous stochastic gradient methods for parameter estimation in conditional random fields.

### A Stochastic Quasi-Newton Method for Large-Scale Optimization

- Computer ScienceSIAM J. Optim.
- 2016

A stochastic quasi-Newton method that is efficient, robust and scalable, and employs the classical BFGS update formula in its limited memory form, based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products.

### A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

- Computer ScienceNIPS
- 2012

A new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex, which incorporates a memory of previous gradient values in order to achieve a linear convergence rate.

### A Universal Catalyst for First-Order Optimization

- Computer ScienceNIPS
- 2015

This work introduces a generic scheme for accelerating first-order optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm, and shows that acceleration is useful in practice, especially for ill-conditioned problems where the authors measure significant improvements.

### A Linearly-Convergent Stochastic L-BFGS Algorithm

- Computer ScienceAISTATS
- 2016

It is demonstrated experimentally that the proposed new stochastic L-BFGS algorithm performs well on large-scale convex and non-convex optimization problems, exhibiting linear convergence and rapidly solving the optimization problems to high levels of precision.

### Linear Convergence with Condition Number Independent Access of Full Gradients

- Computer ScienceNIPS
- 2013

This paper proposes to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function, and presents a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients.

### A Unifying Framework for Convergence Analysis of Approximate Newton Methods

- Computer ScienceArXiv
- 2017

A unifying framework to analyze local convergence properties of second order methods is proposed and based on this framework, the theoretical analysis matches the performance in real applications.

### Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer ScienceJ. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.