Quasi-Newton methods for machine learning: forget the past, just sample

@article{Berahas2021QuasiNewtonMF,
  title={Quasi-Newton methods for machine learning: forget the past, just sample},
  author={Albert S. Berahas and Majid Jahani and Peter Richt{\'a}rik and Martin Tak'avc},
  journal={Optimization Methods and Software},
  year={2021}
}
We present two sampled quasi-Newton methods (sampled LBFGS and sampled LSR1) for solving empirical risk minimization problems that arise in machine learning. Contrary to the classical variants of these methods that sequentially build Hessian or inverse Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate at every iteration to produce these approximations. As a result, the approximations constructed make use of more… 
Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks
TLDR
This paper aims to investigate accelerating the Symmetric Rank-1 (SR1) quasi-Newton method with the Nesterov’s gradient for training neural networks and briefly discuss its convergence.
A Novel Fast Exact Subproblem Solver for Stochastic Quasi-Newton Cubic Regularized Optimization
TLDR
This work describes an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited-memory Quasi-Newton (LQN) matrices, and shows that the new approach, ARCLQN, compares to modern optimizers with minimal tuning.
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models
TLDR
This paper proposes a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer, using the quasi-Newton matrices from the forward pass to efficiently approximate the inverse Jacobian matrix in the direction needed for the gradient computation.
QN Optimization with Hessian Sample
This article explores how to effectively incorporate curvature information generated using SIMD-parallel forward-mode Algorithmic Differentiation (AD) into unconstrained QuasiNewton (QN) minimization
FLECS: A Federated Learning Second-Order Framework via Compression and Sketching
TLDR
A new communication second-order framework for Federated learning, namely FLECS is proposed, which reduces the high-memory requirements of FedNL by the usage of an L-SR1 type update for the Hessian approximation which is stored on the central server.

References

SHOWING 1-10 OF 66 REFERENCES
A robust multi-batch L-BFGS method for machine learning*
TLDR
This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and non-convex functions, and illustrates the behaviour of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.
Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study
TLDR
Detailed empirical evaluations of a class of Newton-type methods, namely sub-sampled variants of trust region (TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex ML problems demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings.
A Multi-Batch L-BFGS Method for Machine Learning
TLDR
This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.
A Stochastic Quasi-Newton Method for Large-Scale Optimization
TLDR
A stochastic quasi-Newton method that is efficient, robust and scalable, and employs the classical BFGS update formula in its limited memory form, based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy
TLDR
The proposed DANCE method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples and reduces the number of passes over data to achieve the statistical accuracy of the full training set.
On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning
TLDR
Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration.
adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs
TLDR
adaQN is presented, a stochastic quasi-Newton algorithm for training RNNs that retains a low per-iteration cost while allowing for non-diagonal scaling through a Stochastic L-BFGS updating scheme and is judicious in storing and retaining L- BFGS curvature pairs.
Train faster, generalize better: Stability of stochastic gradient descent
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
...
...