CLBlast: A Tuned OpenCL BLAS Library

@article{Nugteren2018CLBlastAT,
  title={CLBlast: A Tuned OpenCL BLAS Library},
  author={Cedric Nugteren},
  journal={Proceedings of the International Workshop on OpenCL},
  year={2018}
}
  • C. Nugteren
  • Published 12 May 2017
  • Computer Science
  • Proceedings of the International Workshop on OpenCL
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it… 

Figures and Tables from this paper

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL
TLDR
The implementation of a parametric tile-based TRSM routine for SYCL-BLAS is presented by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL, an open-source BLAS library that provides performance portability across variousSYCL-enabled platforms.
ClPy: A NumPy-Compatible Library Accelerated with OpenCL
TLDR
ClPy, a Python library that supports OpenCL with a simple NumPy-like interface, and an extension of Chainer machine learning framework for OpenCL support, is developed and demonstrates it achieves reasonable performance on several machine learning applications.
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond
TLDR
This work analyzes different machine learning techniques and predictive models to accelerate the convolution operator and GEMM, and addresses the problem of dataset generation, and studies the performance, accuracy, and generalization ability of the models.
A model-driven approach for a new generation of adaptive libraries
TLDR
A new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets is presented and the effectiveness of a BLAS library is demonstrated.
Triton: an intermediate language and compiler for tiled neural network computations
TLDR
Triton is presented, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays for expressing tensor programs in terms of operations on parametric tile variables and a set of novel tile-level optimization passes for compiling these programs into efficient GPU code.
Kernel Tuner: A search-optimizing GPU code auto-tuner
Accelerating winograd convolutions using symbolic computation and meta-programming
TLDR
This paper proposes a novel method to optimize Winograd convolutions based on symbolic computation and shows that the optimization technique can effectively exploit repetitive patterns, enabling it to reduce the number of arithmetic operations by up to 62% without compromising numerical stability.
Performance portability through machine learning guided kernel selection in SYCL libraries
Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels
TLDR
This paper shows that by writing highly parameterized kernels for matrix multiplies and convolutions the authors achieve performance competitive with vendor implementations across different architectures.
Portable Parallel Performance via Multi-Dimensional Homomorphisms
TLDR
Several state-of-the-art approaches aim at providing portability of performance, but they are mostly limited to restricted combinations of: 1) applications, 2) hardware architectures, and/or 3) input sizes.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 22 REFERENCES
cuDNN: Efficient Primitives for Deep Learning
TLDR
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
CLTune: A Generic Auto-Tuner for OpenCL Kernels
  • C. Nugteren, V. Codreanu
  • Computer Science
    2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip
  • 2015
TLDR
This work presents CLTune, an auto-tuner for OpenCL kernels that evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations, and supports multiple search strategies including simulated annealing and particle swarm optimisation.
A Note on Auto-tuning GEMM for GPUs
TLDR
Some GPU GEMM auto-tuning optimization techniques that allow the development of high performance dense linear algebra to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas are described.
Input-aware auto-tuning of compute-bound HPC kernels
TLDR
An input-aware auto-tuning framework for matrix multiplications and convolutions, ISAAC, which uses predictive modeling techniques to drive highly parameterized PTX code templates towards not only hardware-, but also application-specific kernels.
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
TLDR
This paper has developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL that shows higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable.
Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability
  • Thomas L. Falch, A. Elster
  • Computer Science
    2015 IEEE International Parallel and Distributed Processing Symposium Workshop
  • 2015
TLDR
This paper uses machine learning-based auto-tuning to address poor performance portability in heterogeneous computing, and builds an artificial neural network based model that achieves a mean relative error as low as 6.1%, and is able to find configurations as little as 1.3% worse than the global minimum.
MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs
TLDR
This work proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve linear algebra operations on many small-sized matrices and applied the batched methodology in a real-world Hydrodynamic application.
A Comparison of Potential Interfaces for Batched BLAS Computations
TLDR
This work discusses many possible ways in which the BLAS standard can be extended for batch operations, giving benefits and criticisms of each, along with a number of experiments designed to determine how the API may affect performance on modern HPC systems.
Performance, Design, and Autotuning of Batched GEMM for GPUs
TLDR
The general matrix-matrix multiplication (GEMM) kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general.
CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++11 applications on OpenCL™ 1.2 Devices
TLDR
Tensorflow framework is used as a case-study, and the ability to run unary, binary and reduction Tensorflow and Eigen kernels, with no modification to the original CUDA source-code is demonstrated.
...
1
2
3
...