# Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches

@inproceedings{Hanzely2018AcceleratedCD, title={Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches}, author={Filip Hanzely and Peter Richt{\'a}rik}, booktitle={International Conference on Artificial Intelligence and Statistics}, year={2018} }

Accelerated coordinate descent is a widely popular optimization algorithm due to its efficiency on large-dimensional problems. It achieves state-of-the-art complexity on an important class of empirical risk minimization problems. In this paper we design and analyze an accelerated coordinate descent (ACD) method which in each iteration updates a random subset of coordinates according to an arbitrary but fixed probability law, which is a parameter of the method. If all coordinates are updated in…

## 31 Citations

### Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods

- Computer ScienceArXiv
- 2022

A unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities is proposed.

### Stochastic Subspace Descent

- Computer Science, Mathematics
- 2019

We present two stochastic descent algorithms that apply to unconstrained optimization and are particularly efficient when the objective function is slow to evaluate and gradients are not easily…

### SGD: General Analysis and Improved Rates

- Computer ScienceICML 2019
- 2019

This theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches, and can determine the mini-batch size that optimizes the total complexity.

### Nonconvex Variance Reduced Optimization with Arbitrary Sampling

- Computer ScienceICML
- 2019

Surprisingly, this approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization.

### A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent

- Computer ScienceAISTATS
- 2020

A unified analysis of a large family of variants of proximal stochastic gradient descent, which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities is introduced.

### On the Worst-Case Analysis of Cyclic Coordinate-Wise Algorithms on Smooth Convex Functions

- Computer Science
- 2022

A new upper bound for cyclic coordinate descent is obtained that outperforms the best available ones by an order of magnitude and is provides numerical evidence for the fact that a standard scheme that provably accelerates random coordinate descent to a O (1 /k 2 ) complexity is actually inefﬁcient when used in a (deterministic) cyclic algorithm.

### Fast Cyclic Coordinate Dual Averaging with Extrapolation for Generalized Variational Inequalities

- Computer ScienceArXiv
- 2021

CODER is the first cyclic block coordinate method whose convergence rate is independent of the number of blocks, which fills the significant gap between cyclic coordinate methods and randomized ones that remained open for many years.

### Variance Reduced Coordinate Descent with Acceleration: New Method With a Surprising Application to Finite-Sum Problems

- Computer Science, MathematicsICML
- 2020

The ASVRCD method can deal with problems that include a non-separable and non-smooth regularizer, while accessing a random block of partial derivatives in each iteration only, and incorporates Nesterov's momentum, which offers favorable iteration complexity guarantees over both SEGA and SVRCD.

### SAGA with Arbitrary Sampling

- Computer ScienceICML
- 2019

An iteration complexity analysis of the SAGA algorithm is performed and linear convergence rates match those of the primal-dual method Quartz for which an arbitrary sampling analysis is available, which makes a significant step towards closing the gap in the understanding of complexity of primal and dual methods for finite sum problems.

### Convergence Analysis of Block Coordinate Algorithms with Determinantal Sampling

- Mathematics, Computer ScienceAISTATS
- 2020

The convergence rate of the randomized Newton-like method for smooth and convex objectives, which uses random coordinate blocks of a Hessian-over-approximation matrix $\bM$ instead of the true Hessian, is analyzed and a fundamental new expectation formula for determinantal point processes is derived.

## References

SHOWING 1-10 OF 36 REFERENCES

### Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

- Computer ScienceICML
- 2016

This paper improves the best known running time of accelerated coordinate descent by a factor up to $n, based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter.

### When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent

- Computer ScienceNIPS
- 2017

This paper provides examples and more generally problem classes for which CCD (or CD with any deterministic order) is faster than RCD in terms of asymptotic worst-case convergence and provides lower and upper bounds on the amount of improvement on the rate of CCD relative to RCD.

### Coordinate descent with arbitrary sampling I: algorithms and complexity†

- Computer ScienceOptim. Methods Softw.
- 2016

A complexity analysis of ALPHA is provided, from which it is deduced as a direct corollary complexity bounds for its many variants, all matching or improving best known bounds.

### On optimal probabilities in stochastic coordinate descent methods

- Computer ScienceOptim. Lett.
- 2016

A new parallel coordinate descent method is proposed and analyzed, in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen using an arbitrary probability law, which is the first method of this type.

### Accelerated, Parallel, and Proximal Coordinate Descent

- Computer Science, MathematicsSIAM J. Optim.
- 2015

A new randomized coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only, which can be implemented without the need to perform full-dimensional vector operations, which is the major bottleneck of accelerated coordinate descent.

### Approximate Steepest Coordinate Descent

- Computer ScienceICML
- 2017

A new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization that can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to $n$, the number of coordinates.

### Parallel coordinate descent methods for big data optimization

- Computer Science, MathematicsMath. Program.
- 2016

In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex…

### Coordinate descent algorithms

- Computer ScienceMath. Program.
- 2015

A certain problem structure that arises frequently in machine learning applications is shown, showing that efficient implementations of accelerated coordinate descent algorithms are possible for problems of this type.

### Safe Adaptive Importance Sampling

- Computer ScienceNIPS
- 2017

It is shown that coordinate-descent and stochastic gradient descent can enjoy significant a speed-up under the novel sampling scheme, and can efficiently be computed - in many applications at negligible extra cost.

### Coordinate descent with arbitrary sampling II: expected separable overapproximation

- MathematicsOptim. Methods Softw.
- 2016

This paper develops a systematic technique for deriving expected separable overapproximation inequalities for a large class of functions and for arbitrary samplings, and demonstrates that one can recover existing ESO results using this general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function.