• Corpus ID: 221081592

Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

  title={Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization},
  author={Pan Zhou and Xiaotong Yuan},
Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size… 

Figures and Tables from this paper

A Hybrid Stochastic-Deterministic Minibatch Proximal Gradient Method for Efficient Optimization and Generalization

A hybrid stochastic-deterministic minibatch proximal gradient algorithm for strongly convex problems with linear prediction structure, e.g., least squares and logistic/softmax regression is proposed.

Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond

It is proved that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-Łojasiewicz condition which has been observed/proved in neural networks.

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

This work analyzes ADAM-alike adaptive gradient algorithms through their Levy-driven stochastic differential equations (SDEs) through their local convergence behaviors, and establishes the escaping time of these SDEs from a local basin to explain the better generalization performance of SGD over ADAM.

Theory-Inspired Path-Regularized Differential Network Architecture Search

It is proved that the architectures with more skip connections can converge faster than the other candidates, and thus are selected by DARTS, and for the first time, theoretically and explicitly reveals the impact of skip connections to fast network optimization and its competitive advantage over other types of operations in DARTS.

Task similarity aware meta learning: theory-inspired improvement on MAML

This supplementary document contains the technical proofs of the results and some additional experimental results of the UAI’21 paper entitled “Task Similarity Aware Meta Learning: Theory-inspired

Tensor principal component analysis



High probability generalization bounds for uniformly stable algorithms with nearly optimal rate

A nearly tight bound is proved that for algorithms that are uniformly stable with $\gamma = O(1/\sqrt{n})$, estimation error is essentially the same as the sampling error, leading to the first high-probability generalization bounds for multi-pass stochastic gradient descent and regularized ERM.

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.

Efficient Stochastic Gradient Hard Thresholding

It is proved that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically.

New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity

This paper affirmatively shows that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent.

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite

Asynchronous Accelerated Proximal Stochastic Gradient for Strongly Convex Distributed Finite Sums

The decentralized and asynchronous algorithm ADFS is proposed to tackle the case when local functions are themselves finite sums with $m$ components, and can be formulated for non-smooth objectives with equally good scaling properties.

Stochastic Convex Optimization

Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability.

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.

Understanding Generalization and Optimization Performance of Deep CNNs

This work proves the one-to-one correspondence and convergence guarantees for the non-degenerate stationary points between the empirical and population risks, and proves that for an arbitrary gradient descent algorithm, the computed approximate stationary point by minimizing empirical risk is also an approximate stationary points to the population risk.