• Corpus ID: 220041939

Randomized Block-Diagonal Preconditioning for Parallel Learning

  title={Randomized Block-Diagonal Preconditioning for Parallel Learning},
  author={Celestine Mendler-D{\"u}nner and Aur{\'e}lien Lucchi},
We study preconditioned gradient-based optimization methods where the preconditioning matrix has block-diagonal form. Such a structural constraint comes with the advantage that the update computation can be parallelized across multiple independent tasks. Our main contribution is to demonstrate that the convergence of these methods can significantly be improved by a randomization technique which corresponds to repartitioning coordinates across tasks during the optimization procedure. We provide… 
Communication-Efficient Distributed Optimization with Quantized Preconditioners
This work designs and analyzes the first communication-efficient distributed variants of preconditioned gradient descent for Generalized Linear Models, and for Newton’s method, and relies on a new technique for quantizing both the preconditionser and the descent direction at each step of the algorithms, while controlling their convergence rate.


Distributed block-diagonal approximation methods for regularized empirical risk minimization
This paper proposes a flexible framework for distributed ERM training through solving the dual problem, which provides a unified description and comparison of existing methods and is versatile to be applied on many large-scale machine learning problems including classification, regression, and structured prediction.
Communication-Efficient Parallel Block Minimization for Kernel Machines
This paper develops a parallel block minimization framework for solving kernel machines, including kernel SVM and kernel logistic regression, and proves global linear convergence rate of the proposed method with a wide class of subproblem solvers.
Parallel coordinate descent methods for big data optimization
In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex
A Distributed Second-Order Algorithm You Can Trust
A new algorithm for distributed training of generalized linear models that only requires the computation of diagonal blocks of the Hessian matrix on the individual workers and dynamically adapts the auxiliary model to compensate for modeling errors is presented.
Adding vs. Averaging in Distributed Primal-Dual Optimization
A novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization, which allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging.
CoCoA: A General Framework for Communication-Efficient Distributed Optimization
This work presents a general-purpose framework for distributed computing environments, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing, and extends the framework to cover general non-strongly-convex regularizers, including L1-regularized problems like lasso.
A distributed block coordinate descent method for training $l_1$ regularized linear classifiers
The main idea of the algorithm is to do block optimization of many variables on the actual objective function within each computing node, which increases the computational cost per step that is matched with the communication cost, and decreases the number of outer iterations, thus yielding a faster overall method.
Sub-sampled Cubic Regularization for Non-convex Optimization
This work provides a sampling scheme that gives sufficiently accurate gradient and Hessian approximations to retain the strong global and local convergence guarantees of cubically regularized methods, and is the first work that gives global convergence guarantees for a sub-sampled variant of cubic regularization on non-convex functions.
DiSCO: Distributed Optimization for Self-Concordant Empirical Loss
The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method, and its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions are analyzed.
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.