# Optimization Methods for Large-Scale Machine Learning

@article{Bottou2018OptimizationMF,
title={Optimization Methods for Large-Scale Machine Learning},
author={L. Bottou and Frank E. Curtis and J. Nocedal},
journal={ArXiv},
year={2018},
volume={abs/1606.04838}
}
• Published 2018
• Computer Science, Mathematics
• ArXiv
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. [...] Key Result This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.Expand
1,493 Citations

#### Paper Mentions

Stochastic Optimization for Machine Learning
Numerical optimization has played an important role in the evolution of machine learning, touching almost every aspect of the discipline. Stochastic approximation has evolved and expanded as one ofExpand
Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning
• Computer Science, Mathematics
• ArXiv
• 2017
The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning, and to discuss how these approaches can be employed to the training of deep neural networks. Expand
Optimization methods for structured machine learning problems
This thesis attempts to solve the `1regularized fixed-point problem with the help of Alternating Direction Method of Multipliers (ADMM) and argues that the proposed method is well suited to the structure of the aforementioned fixed- point problem. Expand
A Survey on Large-scale Machine Learning
• Computer Science, Mathematics
• ArXiv
• 2020
A systematic survey on existing LML methods is offered to provide a blueprint for the future developments of this area and categorize the methods in each perspective according to their targeted scenarios and introduce representative methods in line with intrinsic strategies. Expand
Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis
It is shown that the learning rate in SGD with machine learning noise can be chosen to be small, but uniformly positive for all times if the energy landscape resembles that of overparametrized deep learning problems. Expand
Optimization for Deep Learning: An Overview
This paper discusses the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum and then discusses practical solutions including careful initialization, normalization methods and skip connections, and existing theoretical results. Expand
Optimization Models for Machine Learning: A Survey
• Computer Science, Mathematics
• Eur. J. Oper. Res.
• 2021
The machine learning literature is surveyed and in an optimization framework several commonly used machine learning approaches are presented for regression, classification, clustering, deep learning, and adversarial learning as well as new emerging applications in machine teaching, empirical modelLearning, and Bayesian network structure learning. Expand
A Survey of Optimization Methods From a Machine Learning Perspective
• Computer Science, Mathematics
• IEEE Transactions on Cybernetics
• 2020
The optimization problems in machine learning are described and the principles and progresses of commonly used optimization methods are introduced, which can offer guidance for both developments of optimization and machine learning research. Expand
Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study
• Computer Science, Mathematics
• SDM
• 2020
Detailed empirical evaluations of a class of Newton-type methods, namely sub-sampled variants of trust region (TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex ML problems demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings. Expand
An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise
• Computer Science, Mathematics
• 2019
The empirical studies with standard deep learning model-architectures and datasets shows that the proposed add covariance noise to the gradients method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated. Expand

#### References

SHOWING 1-10 OF 218 REFERENCES
Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning
• Computer Science, Mathematics
• ArXiv
• 2017
The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning, and to discuss how these approaches can be employed to the training of deep neural networks. Expand
Optimization for Machine Learning
• Computer Science
• 2013
This book captures the state of the art of the interaction between optimization and machine learning in a way that is accessible to researchers in both fields and will enrich the ongoing cross-fertilization between the machine learning community and these other fields, and within the broader optimization community. Expand
Sample size selection in optimization methods for machine learning
• Computer Science, Mathematics
• Math. Program.
• 2012
A criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient, and establishes an O(1/\epsilon) complexity bound on the total cost of a gradient method. Expand
Learning Recurrent Neural Networks with Hessian-Free Optimization
• Computer Science
• ICML
• 2011
This work solves the long-outstanding problem of how to effectively train recurrent neural networks on complex and difficult sequence modeling problems which may contain long-term data dependencies and offers a new interpretation of the generalized Gauss-Newton matrix of Schraudolph which is used within the HF approach of Martens. Expand
Neural Networks: Tricks of the Trade
• Computer Science
• Lecture Notes in Computer Science
• 1998
It is shown how nonlinear semi-supervised embedding algorithms popular for use with â œshallowâ learning techniques such as kernel methods can be easily applied to deep multi-layer architectures. Expand
Adam: A Method for Stochastic Optimization
• Computer Science, Mathematics
• ICLR
• 2015
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
• Computer Science, Mathematics
• J. Mach. Learn. Res.
• 2011
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand
On the generalization ability of on-line learning algorithms
• Computer Science, Mathematics
• IEEE Transactions on Information Theory
• 2004
This paper proves tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble, and obtains risk tail bounds for kernel perceptron algorithms interms of the spectrum of the empirical kernel matrix. Expand
On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning
• Mathematics, Computer Science
• SIAM J. Optim.
• 2011
Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration. Expand
Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent
• W. Xu
• Computer Science
• ArXiv
• 2011
A finite sample analysis for the method of Polyak and Juditsky (1992) shows that it indeed usually takes a huge number of samples for ASGD to reach its asymptotic region for improperly chosen learning rate, and a simple way to properly set learning rate is proposed. Expand