# Optimization Methods for Large-Scale Machine Learning

@article{Bottou2018OptimizationMF, title={Optimization Methods for Large-Scale Machine Learning}, author={L. Bottou and Frank E. Curtis and J. Nocedal}, journal={ArXiv}, year={2018}, volume={abs/1606.04838} }

This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. [...] Key Result This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations. Expand

#### Supplemental Content

Presentation Slides

Optimization Methods for Large-Scale Machine Learning

#### Figures, Tables, and Topics from this paper

#### Paper Mentions

#### 1,493 Citations

Stochastic Optimization for Machine Learning

- 2018

Numerical optimization has played an important role in the evolution of machine learning, touching almost every aspect of the discipline. Stochastic approximation has evolved and expanded as one of… Expand

Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

- Computer Science, Mathematics
- ArXiv
- 2017

The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning, and to discuss how these approaches can be employed to the training of deep neural networks. Expand

Optimization methods for structured machine learning problems

- Computer Science
- 2019

This thesis attempts to solve the `1regularized fixed-point problem with the help of Alternating Direction Method of Multipliers (ADMM) and argues that the proposed method is well suited to the structure of the aforementioned fixed- point problem. Expand

A Survey on Large-scale Machine Learning

- Computer Science, Mathematics
- ArXiv
- 2020

A systematic survey on existing LML methods is offered to provide a blueprint for the future developments of this area and categorize the methods in each perspective according to their targeted scenarios and introduce representative methods in line with intrinsic strategies. Expand

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

- Computer Science, Mathematics
- ArXiv
- 2021

It is shown that the learning rate in SGD with machine learning noise can be chosen to be small, but uniformly positive for all times if the energy landscape resembles that of overparametrized deep learning problems. Expand

Optimization for Deep Learning: An Overview

- Computer Science
- 2020

This paper discusses the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum and then discusses practical solutions including careful initialization, normalization methods and skip connections, and existing theoretical results. Expand

Optimization Models for Machine Learning: A Survey

- Computer Science, Mathematics
- Eur. J. Oper. Res.
- 2021

The machine learning literature is surveyed and in an optimization framework several commonly used machine learning approaches are presented for regression, classification, clustering, deep learning, and adversarial learning as well as new emerging applications in machine teaching, empirical modelLearning, and Bayesian network structure learning. Expand

A Survey of Optimization Methods From a Machine Learning Perspective

- Computer Science, Mathematics
- IEEE Transactions on Cybernetics
- 2020

The optimization problems in machine learning are described and the principles and progresses of commonly used optimization methods are introduced, which can offer guidance for both developments of optimization and machine learning research. Expand

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

- Computer Science, Mathematics
- SDM
- 2020

Detailed empirical evaluations of a class of Newton-type methods, namely sub-sampled variants of trust region (TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex ML problems demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings. Expand

An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise

- Computer Science, Mathematics
- 2019

The empirical studies with standard deep learning model-architectures and datasets shows that the proposed add covariance noise to the gradients method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated. Expand

#### References

SHOWING 1-10 OF 218 REFERENCES

Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

- Computer Science, Mathematics
- ArXiv
- 2017

The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning, and to discuss how these approaches can be employed to the training of deep neural networks. Expand

Optimization for Machine Learning

- Computer Science
- 2013

This book captures the state of the art of the interaction between optimization and machine learning in a way that is accessible to researchers in both fields and will enrich the ongoing cross-fertilization between the machine learning community and these other fields, and within the broader optimization community. Expand

Sample size selection in optimization methods for machine learning

- Computer Science, Mathematics
- Math. Program.
- 2012

A criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient, and establishes an O(1/\epsilon) complexity bound on the total cost of a gradient method. Expand

Learning Recurrent Neural Networks with Hessian-Free Optimization

- Computer Science
- ICML
- 2011

This work solves the long-outstanding problem of how to effectively train recurrent neural networks on complex and difficult sequence modeling problems which may contain long-term data dependencies and offers a new interpretation of the generalized Gauss-Newton matrix of Schraudolph which is used within the HF approach of Martens. Expand

Neural Networks: Tricks of the Trade

- Computer Science
- Lecture Notes in Computer Science
- 1998

It is shown how nonlinear semi-supervised embedding algorithms popular for use with â œshallowâ learning techniques such as kernel methods can be easily applied to deep multi-layer architectures. Expand

Adam: A Method for Stochastic Optimization

- Computer Science, Mathematics
- ICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand

On the generalization ability of on-line learning algorithms

- Computer Science, Mathematics
- IEEE Transactions on Information Theory
- 2004

This paper proves tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble, and obtains risk tail bounds for kernel perceptron algorithms interms of the spectrum of the empirical kernel matrix. Expand

On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning

- Mathematics, Computer Science
- SIAM J. Optim.
- 2011

Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration. Expand

Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent

- Computer Science
- ArXiv
- 2011

A finite sample analysis for the method of Polyak and Juditsky (1992) shows that it indeed usually takes a huge number of samples for ASGD to reach its asymptotic region for improperly chosen learning rate, and a simple way to properly set learning rate is proposed. Expand