On Linear Learning with Manycore Processors

@article{Wszola2019OnLL,
  title={On Linear Learning with Manycore Processors},
  author={Eliza Wszola and Celestine Mendler-D{\"u}nner and Martin Jaggi and Markus P{\"u}schel},
  journal={2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)},
  year={2019},
  pages={184-194}
}
A new generation of manycore processors is on the rise that offers dozens and more cores on a chip and, in a sense, fuses host processor and accelerator. In this paper we target the efficient training of generalized linear models on these machines. We propose a novel approach for achieving parallelism which we call Heterogeneous Tasks on Homogeneous Cores (HTHC). It divides the problem into multiple fundamentally different tasks, which themselves are parallelized. For evaluation, we design a… 

Advances in Asynchronous Parallel and Distributed Optimization

This article reviews recent developments in the design and analysis of asynchronous optimization methods, covering both centralized methods, where all processors update a master copy of the optimization variables, and decentralized methods,where each processor maintains a local copy ofThe analysis provides insights into how the degree of asynchrony impacts convergence rates, especially in stochastic optimization methods.

References

SHOWING 1-10 OF 53 REFERENCES

Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems

The scheme allows to efficiently employ compute accelerators such as GPUs and FPGAs for the training of large-scale machine learning models, when the training data exceeds their memory capacity, and provides adaptivity to any system's memory hierarchy in terms of size and processing speed.

cuDNN: Efficient Primitives for Deep Learning

A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.

Scaling Deep Learning on GPU and Knights Landing clusters

  • Yang YouA. BuluçJ. Demmel
  • Computer Science
    SC17: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2017
A redesign of four efficient algorithms for HPC systems to improve EASGD’s poor scaling on clusters, which are faster than existing counterpart methods (Async SGD, Async MSGD, and Hogwild SGD) in all comparisons.

Linear support vector machines via dual cached loops

StreamSVM, the first algorithm for training linear Support Vector Machines (SVMs) which takes advantage of these properties by integrating caching with optimization by performing updates in the dual, thus obviating the need to rebalance frequently visited examples.

Fast Quantized Arithmetic on x86: Trading Compute for Data Movement

Clover, a new library for efficient computation using low-precision data, provides mathematical routines required by fundamental methods in optimization and sparse recovery, and supports data formats from 4-bit quantized to 32-bit IEEE-754 on current Intel processors.

MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In

Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

A performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path is provided.

Partitioning Compute Units in CNN Acceleration for Statistical Memory Traffic Shaping

A strategy of partitioning the compute units where the cores within each partition process a batch of input data in a synchronous manner to maximize data reuse but different partitions run asynchronously is proposed, which can lead to 8.0 percent of performance gain on a commercial 64-core processor when running ResNet-50.

ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning

The ZipML framework is able to execute training at low precision with no bias, guaranteeing convergence, whereas naive quantization would introduce significant bias, and it enables an FPGA prototype that is up to 6.5× faster than an implementation using full 32-bit precision.

Robust Large-Scale Machine Learning in the Cloud

A new scalable coordinate descent algorithm for generalized linear models whose convergence behavior is always the same, regardless of how much SCD is scaled out and regardless of the computing environment, which makes SCD highly robust and enables it to scale to massive datasets on low-cost commodity servers.
...