Corpus ID: 53102513

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

@article{Jahani2020EfficientDH,
  title={Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy},
  author={Majid Jahani and Xi He and Chenxin Ma and Aryan Mokhtari and D. Mudigere and Alejandro Ribeiro and Martin Tak{\'a}c},
  journal={ArXiv},
  year={2020},
  volume={abs/1810.11507}
}
In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve… Expand
Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information
We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamicallyExpand
Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample
TLDR
Numerical tests on a toy classification problem as well as on popular benchmarking binary classification and neural network training tasks reveal that the proposed sampled quasi-Newton methods outperform their classical variants. Expand
SONIA: A Symmetric Blockwise Truncated Optimization Algorithm
TLDR
Theoretical results are presented to confirm that the algorithm converges to a stationary point in both the strongly convex and nonconvex cases, and a stochastic variant of the algorithm is also presented, along with corresponding theoretical guarantees. Expand
Efficient Nonconvex Empirical Risk Minimization via Adaptive Sample Size Methods
TLDR
This paper proposes an adaptive sample size scheme to reduce the overall computational complexity of finding a local minimum of an empirical risk minimization (ERM) problem where the loss associated with each sample is possibly a nonconvex function. Expand
Grow Your Samples and Optimize Better via Distributed Newton CG and Accumulating Strategy
In this work1, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is underExpand
Quasi-Newton Methods for Deep Learning: Forget the Past, Just Sample
TLDR
Numerical tests on a toy classification problem as well as on popular benchmarking neural network training tasks reveal that the sampled quasi-Newton methods outperform their classical variants. Expand
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates. Expand
Sampled Quasi-Newton Methods for Deep Learning
We present two sampled quasi-Newton methods: sampled LBFGS and sampled LSR1. Contrary to the classical variants that sequentially build Hessian approximations, our proposed methods sample pointsExpand
Scaling Up Quasi-newton Algorithms: Communication Efficient Distributed SR1
TLDR
DS-LSR1 is proposed, a communication-efficient variant of the S- LSR1 method, that drastically reduces the amount of data communicated at every iteration, that has favorable work-load balancing across nodes and that is matrix-free and inverse-free. Expand
LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning
TLDR
A unified analysis of gradient coding, worker grouping, and adaptive worker selection techniques in terms of wall-clock time, communication, and computation complexity measures shows that G-LAG provides the best wall- clock time and communication performance while maintaining a low computational cost. Expand
...
1
2
...

References

SHOWING 1-10 OF 43 REFERENCES
Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method
TLDR
This paper proposes a novel adaptive sample size second-order method, which reduces the cost of computing the Hessian by solving a sequence of ERM problems corresponding to a subset of samples and lowers thecost of computingThe Hessian inverse using a truncated eigenvalue decomposition. Expand
DiSCO: Distributed Optimization for Self-Concordant Empirical Loss
TLDR
The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method, and its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions are analyzed. Expand
A Stochastic Quasi-Newton Method for Large-Scale Optimization
TLDR
A stochastic quasi-Newton method that is efficient, robust and scalable, and employs the classical BFGS update formula in its limited memory form, based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. Expand
First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization
TLDR
Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. Expand
A Multi-Batch L-BFGS Method for Machine Learning
TLDR
This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases. Expand
Quasi-Newton Methods for Deep Learning: Forget the Past, Just Sample
TLDR
Numerical tests on a toy classification problem as well as on popular benchmarking neural network training tasks reveal that the sampled quasi-Newton methods outperform their classical variants. Expand
Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy
TLDR
It is shown theoretically and empirically that Ada Newton can double the size of the training set in each iteration to achieve the statistical accuracy of the full training set with about two passes over the dataset. Expand
Sub-sampled Newton methods
For large-scale finite-sum minimization problems, we study non-asymptotic and high-probability global as well as local convergence properties of variants of Newton’s method where the Hessian and/orExpand
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
TLDR
It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive. Expand
A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems
TLDR
A new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. Expand
...
1
2
3
4
5
...