• Corpus ID: 221081592

# Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

@article{Zhou2020HybridSM,
title={Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization},
author={Pan Zhou and Xiaotong Yuan},
journal={ArXiv},
year={2020},
volume={abs/2009.09835}
}
• Published 12 July 2020
• Computer Science
• ArXiv
Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size…
6 Citations

## Figures and Tables from this paper

• Computer Science
IEEE Transactions on Pattern Analysis and Machine Intelligence
• 2022
A hybrid stochastic-deterministic minibatch proximal gradient algorithm for strongly convex problems with linear prediction structure, e.g., least squares and logistic/softmax regression is proposed.
• Computer Science
NeurIPS
• 2021
It is proved that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-Łojasiewicz condition which has been observed/proved in neural networks.
• Computer Science
NeurIPS
• 2020
This work analyzes ADAM-alike adaptive gradient algorithms through their Levy-driven stochastic differential equations (SDEs) through their local convergence behaviors, and establishes the escaping time of these SDEs from a local basin to explain the better generalization performance of SGD over ADAM.
• Computer Science
NeurIPS
• 2020
It is proved that the architectures with more skip connections can converge faster than the other candidates, and thus are selected by DARTS, and for the first time, theoretically and explicitly reveals the impact of skip connections to fast network optimization and its competitive advantage over other types of operations in DARTS.
• Computer Science
UAI
• 2021
This supplementary document contains the technical proofs of the results and some additional experimental results of the UAI’21 paper entitled “Task Similarity Aware Meta Learning: Theory-inspired

## References

SHOWING 1-10 OF 40 REFERENCES

• Computer Science, Mathematics
COLT
• 2019
A nearly tight bound is proved that for algorithms that are uniformly stable with $\gamma = O(1/\sqrt{n})$, estimation error is essentially the same as the sampling error, leading to the first high-probability generalization bounds for multi-pass stochastic gradient descent and regularized ERM.
• Computer Science
ICML
• 2012
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.
• Computer Science
NeurIPS
• 2018
It is proved that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically.
• Computer Science
NeurIPS
• 2018
This paper affirmatively shows that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent.
• Computer Science, Mathematics
NIPS
• 2013
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which
• Computer Science
J. Mach. Learn. Res.
• 2017
We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite
• Computer Science, Mathematics
ArXiv
• 2019
The decentralized and asynchronous algorithm ADFS is proposed to tackle the case when local functions are themselves finite sums with $m$ components, and can be formulated for non-smooth objectives with equally good scaling properties.
• Computer Science
COLT
• 2009
Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability.
• Computer Science
NIPS
• 2013
It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.
• Computer Science
ICML
• 2018
This work proves the one-to-one correspondence and convergence guarantees for the non-degenerate stationary points between the empirical and population risks, and proves that for an arbitrary gradient descent algorithm, the computed approximate stationary point by minimizing empirical risk is also an approximate stationary points to the population risk.