• Corpus ID: 237491843

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

@article{Jahani2022DoublyAS,
  title={Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information},
  author={Majid Jahani and Sergey G. Rusakov and Zheng Shi and Peter Richt{\'a}rik and Michael W. Mahoney and Martin Tak'avc},
  journal={ArXiv},
  year={2022},
  volume={abs/2109.05198}
}
We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size. The search direction contains gradient information preconditioned by a well-scaled diagonal preconditioning matrix that captures the local curvature information. Our methodology does not require the tedious task of learning rate tuning, as the learning rate is… 

Stochastic Gradient Methods with Preconditioned Updates

Because the adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.

On Scaled Methods for Saddle Point Problems

A theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation.

References

SHOWING 1-10 OF 49 REFERENCES

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

Detailed empirical evaluations of a class of Newton-type methods, namely sub-sampled variants of trust region (TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex ML problems demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings.

SONIA: A Symmetric Blockwise Truncated Optimization Algorithm

Theoretical results are presented to confirm that the algorithm converges to a stationary point in both the strongly convex and nonconvex cases, and a stochastic variant of the algorithm is also presented, along with corresponding theoretical guarantees.

A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization

Numerical experiments illustrate that the method and a limited memory variant of it are stable and outperform (mini-batch) stochastic gradient and other quasi-Newton methods when employed to solve a few machine learning problems.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning

Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration.

A Multi-Batch L-BFGS Method for Machine Learning

This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

The proposed DANCE method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples and reduces the number of passes over data to achieve the statistical accuracy of the full training set.

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

ADAHESSIAN is a new stochastic optimization algorithm that directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including a fast Hutchinson based method to approximate the curvature matrix with low computational overhead.

Online Learning Rate Adaptation with Hypergradient Descent

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a

Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence

It is proved that the proposed stochastic Polyak step-size (SPS) enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead.