• Corpus ID: 51874893

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning

@inproceedings{Mishchenko2018ADP,
  title={A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning},
  author={Konstantin Mishchenko and Franck Iutzeler and J{\'e}r{\^o}me Malick and Massih-Reza Amini},
  booktitle={ICML},
  year={2018}
}
Distributed learning aims at computing high-quality models by training over scattered data. This covers a diversity of scenarios, including computer clusters or mobile agents. One of the main challenges is then to deal with heterogeneous machines and unreliable communications. In this setting, we propose and analyze a flexible asynchronous optimization algorithm for solving nonsmooth learning problems. Unlike most existing methods, our algorithm is adjustable to various levels of communication… 

Figures from this paper

A Distributed Flexible Delay-Tolerant Proximal Gradient Algorithm
TLDR
This work develops and analyzes an asynchronous algorithm for distributed convex optimization when the objective writes a sum of smooth functions, local to each worker, and a non-smooth function, and proves that the algorithm converges linearly in the strongly convex case, and provides guarantees of convergence for the non-strongly conveX case.
Delay-adaptive step-sizes for asynchronous learning
TLDR
This paper develops general convergence results for delay-adaptive asynchronous iterations and specialize these to proximal incremental gradient descent and block-coordinate descent algorithms and demonstrates how delays can be measured on-line, present delay- Adaptive step-size policies, and illustrate their theoretical and practical advantages over the state-of-the-art.
Optimal convergence rates of totally asynchronous optimization
TLDR
This paper derives explicit convergence rates for the proximal incremental aggregated gradient (PIAG) and the asynchronous block-coordinate descent (Async-BCD) methods under a specific model of total asynchrony, and shows that the derived rates are order-optimal.
Asynchronous Distributed Learning with Sparse Communications and Identification
TLDR
An asynchronous optimization algorithm for distributed learning that efficiently reduces the communications between a master and working machines by randomly sparsifying the local updates, and identifies near-optimal sparsity patterns, so that all communications eventually become sparse.
Distributed Learning with Sparse Communications by Identification
TLDR
It is shown that this algorithm converges linearly in the strongly convex case and also identifies optimal strongly sparse solutions and proposes an automatic dimension reduction, aptly sparsifying all exchanges between coordinator and workers.
Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks
TLDR
A framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance, robustness to gradient noise and dependence to network effects is developed.
DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate
TLDR
A distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence guarantees is developed, believed to be the first distributed asynchronous algorithm with super linear convergence guarantees to be developed.
Advances in Asynchronous Parallel and Distributed Optimization
TLDR
This article reviews recent developments in the design and analysis of asynchronous optimization methods, covering both centralized methods, where all processors update a master copy of the optimization variables, and decentralized methods,where each processor maintains a local copy ofThe analysis provides insights into how the degree of asynchrony impacts convergence rates, especially in stochastic optimization methods.
Sparse Asynchronous Distributed Learning
TLDR
An asynchronous distributed learning algorithm where parameter updates are performed by worker machines simultaneously on a local sub-part of the training data with a better convergence rate and much less parameter exchanges between the master and the worker machines than without using the sparsification technique.
L-DQN: An Asynchronous Limited-Memory Distributed Quasi-Newton Method
TLDR
This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model, which is the first distributed quasi-Newton method with provable global linear convergence guarantees in the asynchronous setting where delays between nodes are present.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Asynchronous Distributed ADMM for Consensus Optimization
TLDR
An asynchronous ADMM algorithm is proposed by using two conditions to control the asynchrony: partial barrier and bounded delay and achieves faster convergence than its synchronous counterpart in terms of the wall clock time.
Adding vs. Averaging in Distributed Primal-Dual Optimization
TLDR
A novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization, which allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging.
Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity
TLDR
It is shown that the accelerated distributed stochastic variance reduced gradient algorithm achieves a lower bound for the number of rounds of communication for a broad class of distributed first-order methods including the proposed algorithms in this paper.
An accelerated communication-efficient primal-dual optimization framework for structured machine learning
TLDR
An accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of in terms of reducing suboptimality, and the results of numerical experiments are provided to show that acceleration can lead to significant performance gains.
Federated Optimization: Distributed Machine Learning for On-Device Intelligence
We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number
Asynchronous Coordinate Descent under More Realistic Assumptions
TLDR
It is argued that the convergence of asynchronous-parallel block coordinate descent under more realistic assumptions either fail to hold or will imply less efficient implementations, and is proved always without the independence assumption.
ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates
TLDR
Theoretically, it is shown that if the nonexpansive operator $T$ has a fixed point, then with probability one, ARock generates a sequence that converges to a fixed points of $T$.
A delayed proximal gradient method with linear convergence rate
TLDR
This paper derives an explicit expression that quantifies how the convergence rate depends on objective function properties and algorithm parameters such as step-size and the maximum delay, and reveals the trade-off between convergence speed and residual error.
Distributed optimization with arbitrary local solvers
TLDR
This work presents a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
TLDR
ProxASAGA is proposed, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm that achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term.
...
...