• Corpus ID: 52184014

SEGA: Variance Reduction via Gradient Sketching

@article{Hanzely2018SEGAVR,
  title={SEGA: Variance Reduction via Gradient Sketching},
  author={Filip Hanzely and Konstantin Mishchenko and Peter Richt{\'a}rik},
  journal={ArXiv},
  year={2018},
  volume={abs/1809.03054}
}
We propose a randomized first order optimization method--SEGA (SkEtched GrAdient method)-- which progressively throughout its iterations builds a variance-reduced estimate of the gradient from random linear measurements (sketches) of the gradient obtained from an oracle. In each iteration, SEGA updates the current estimate of the gradient through a sketch-and-project operation using the information provided by the latest sketch, and this is subsequently used to compute an unbiased estimate of… 

Figures and Tables from this paper

Stochastic Steepest Descent Methods for Linear Systems: Greedy Sampling & Momentum
TLDR
The proposed greedy methods significantly outperform the existing methods for a wide variety of datasets such as random test instances as well as real-world datasets (LIBSVM, sparse datasets from matrix market collection).
Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods
TLDR
A unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities is proposed.
A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent
TLDR
A unified analysis of a large family of variants of proximal stochastic gradient descent, which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities is introduced.
Stochastic Subspace Descent
We present two stochastic descent algorithms that apply to unconstrained optimization and are particularly efficient when the objective function is slow to evaluate and gradients are not easily
Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization
TLDR
A unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer is presented and the minibatch size is determined that improves the theoretical total complexity of the methods but also improves their convergence in practice.
Variance Reduced Coordinate Descent with Acceleration: New Method With a Surprising Application to Finite-Sum Problems
TLDR
The ASVRCD method can deal with problems that include a non-separable and non-smooth regularizer, while accessing a random block of partial derivatives in each iteration only, and incorporates Nesterov's momentum, which offers favorable iteration complexity guarantees over both SEGA and SVRCD.
Escaping Saddle Points with Compressed SGD
TLDR
This paper shows that compressed SGD with R AN DOM K compressor converges to an ε -SOSP with the same number of iterations as uncompressed SGD, while improving the total communication by a factor of ˜Θ( √ dε − 3 / 4 ) , where d is the dimension of the optimization problem.
Stochastic Subspace Cubic Newton Method
TLDR
It is proved that as the authors vary the minibatch size, the global convergence rate of SSCN interpolates between the rate of stochastic coordinate descent (CD) and the rates of cubic regularized Newton, thus giving new insights into the connection between first and second-order methods.
Communication Acceleration of Local Gradient Methods via an Accelerated Primal-Dual Algorithm with Inexact Prox
TLDR
The general results offer the new state-of-the-art rates for the class of strongly convex-concave saddle-point problems with bilinear coupling characterized by the absence of smoothness in the dual function.
Linearly Converging Error Compensated SGD
TLDR
A unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates is proposed and the first method called EC-SGD-DIANA is proposed which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with a constant learning rate.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
Stochastic quasi-gradient methods: variance reduction via Jacobian sketching
TLDR
It is proved that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches, and a refined convergence theorem applies to a smaller class of sketches, featuring a novel proof technique based on a stochastic Lyapunov function.
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
TLDR
This work shows that this much-older Polyak-Lojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years, leading to simple proofs of linear convergence of these methods.
Stochastic Block BFGS: Squeezing More Curvature out of Data
TLDR
Numerical tests on large-scale logistic regression problems reveal that the proposed novel limited-memory stochastic block BFGS update is more robust and substantially outperforms current state-of-the-art methods.
Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches
TLDR
This paper designs new importance sampling for mini-batch ACD which significantly outperforms previous state-of-the-art minibatch ACD in practice and proves a rate that is at most three times worse than the rate of minibatches ACD with uniform sampling, but can be three times better.
Coordinate descent with arbitrary sampling I: algorithms and complexity†
TLDR
A complexity analysis of ALPHA is provided, from which it is deduced as a direct corollary complexity bounds for its many variants, all matching or improving best known bounds.
Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling
TLDR
This paper improves the best known running time of accelerated coordinate descent by a factor up to $n, based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter.
Randomized Iterative Methods for Linear Systems
TLDR
A novel, fundamental and surprisingly simple randomized iterative method for solving consistent linear systems, which allows for a much wider selection of these two parameters, which leads to a number of new specific methods.
First-order methods of smooth convex optimization with inexact oracle
TLDR
It is demonstrated that the superiority of fast gradient methods over the classical ones is no longer absolute when an inexact oracle is used, and it is proved that, contrary to simple gradient schemes,fast gradient methods must necessarily suffer from error accumulation.
Stochastic Dual Ascent for Solving Linear Systems
TLDR
It is proved that primal iterates associated with the dual process converge to the projection exponentially fast in expectation, and the same rate applies to dual function values, primal function values and the duality gap.
SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
TLDR
Unlike existing methods such as stochastic dual coordinate ascent, SDNA is capable of utilizing all local curvature information contained in the examples, which leads to striking improvements in both theory and practice.
...
...