Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization
@article{Zhou2020HybridSM, title={Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization}, author={Pan Zhou and Xiaotong Yuan}, journal={ArXiv}, year={2020}, volume={abs/2009.09835} }
Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size…
6 Citations
A Hybrid Stochastic-Deterministic Minibatch Proximal Gradient Method for Efficient Optimization and Generalization
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2022
A hybrid stochastic-deterministic minibatch proximal gradient algorithm for strongly convex problems with linear prediction structure, e.g., least squares and logistic/softmax regression is proposed.
Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond
- Computer ScienceNeurIPS
- 2021
It is proved that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-Łojasiewicz condition which has been observed/proved in neural networks.
Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning
- Computer ScienceNeurIPS
- 2020
This work analyzes ADAM-alike adaptive gradient algorithms through their Levy-driven stochastic differential equations (SDEs) through their local convergence behaviors, and establishes the escaping time of these SDEs from a local basin to explain the better generalization performance of SGD over ADAM.
Theory-Inspired Path-Regularized Differential Network Architecture Search
- Computer ScienceNeurIPS
- 2020
It is proved that the architectures with more skip connections can converge faster than the other candidates, and thus are selected by DARTS, and for the first time, theoretically and explicitly reveals the impact of skip connections to fast network optimization and its competitive advantage over other types of operations in DARTS.
Task similarity aware meta learning: theory-inspired improvement on MAML
- Computer ScienceUAI
- 2021
This supplementary document contains the technical proofs of the results and some additional experimental results of the UAI’21 paper entitled “Task Similarity Aware Meta Learning: Theory-inspired…
References
SHOWING 1-10 OF 40 REFERENCES
High probability generalization bounds for uniformly stable algorithms with nearly optimal rate
- Computer Science, MathematicsCOLT
- 2019
A nearly tight bound is proved that for algorithms that are uniformly stable with $\gamma = O(1/\sqrt{n})$, estimation error is essentially the same as the sampling error, leading to the first high-probability generalization bounds for multi-pass stochastic gradient descent and regularized ERM.
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
- Computer ScienceICML
- 2012
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.
Efficient Stochastic Gradient Hard Thresholding
- Computer ScienceNeurIPS
- 2018
It is proved that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically.
New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity
- Computer ScienceNeurIPS
- 2018
This paper affirmatively shows that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent.
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
- Computer Science, MathematicsNIPS
- 2013
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which…
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
- Computer ScienceJ. Mach. Learn. Res.
- 2017
We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite…
Asynchronous Accelerated Proximal Stochastic Gradient for Strongly Convex Distributed Finite Sums
- Computer Science, MathematicsArXiv
- 2019
The decentralized and asynchronous algorithm ADFS is proposed to tackle the case when local functions are themselves finite sums with $m$ components, and can be formulated for non-smooth objectives with equally good scaling properties.
Stochastic Convex Optimization
- Computer ScienceCOLT
- 2009
Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability.
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
- Computer ScienceNIPS
- 2013
It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.
Understanding Generalization and Optimization Performance of Deep CNNs
- Computer ScienceICML
- 2018
This work proves the one-to-one correspondence and convergence guarantees for the non-degenerate stationary points between the empirical and population risks, and proves that for an arbitrary gradient descent algorithm, the computed approximate stationary point by minimizing empirical risk is also an approximate stationary points to the population risk.