CLBlast: A Tuned OpenCL BLAS Library
@article{Nugteren2018CLBlastAT, title={CLBlast: A Tuned OpenCL BLAS Library}, author={Cedric Nugteren}, journal={Proceedings of the International Workshop on OpenCL}, year={2018} }
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it…
50 Citations
Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL
- Computer ScienceIWOCL
- 2021
The implementation of a parametric tile-based TRSM routine for SYCL-BLAS is presented by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL, an open-source BLAS library that provides performance portability across variousSYCL-enabled platforms.
ClPy: A NumPy-Compatible Library Accelerated with OpenCL
- Computer Science2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
- 2019
ClPy, a Python library that supports OpenCL with a simple NumPy-like interface, and an extension of Chainer machine learning framework for OpenCL support, is developed and demonstrates it achieves reasonable performance on several machine learning applications.
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond
- Computer ScienceACM Trans. Archit. Code Optim.
- 2021
This work analyzes different machine learning techniques and predictive models to accelerate the convolution operator and GEMM, and addresses the problem of dataset generation, and studies the performance, accuracy, and generalization ability of the models.
A model-driven approach for a new generation of adaptive libraries
- Computer ScienceArXiv
- 2018
A new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets is presented and the effectiveness of a BLAS library is demonstrated.
Triton: an intermediate language and compiler for tiled neural network computations
- Computer ScienceMAPL@PLDI
- 2019
Triton is presented, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays for expressing tensor programs in terms of operations on parametric tile variables and a set of novel tile-level optimization passes for compiling these programs into efficient GPU code.
Kernel Tuner: A search-optimizing GPU code auto-tuner
- Computer ScienceFuture Gener. Comput. Syst.
- 2019
Accelerating winograd convolutions using symbolic computation and meta-programming
- Computer ScienceEuroSys
- 2020
This paper proposes a novel method to optimize Winograd convolutions based on symbolic computation and shows that the optimization technique can effectively exploit repetitive patterns, enabling it to reduce the number of arithmetic operations by up to 62% without compromising numerical stability.
Performance portability through machine learning guided kernel selection in SYCL libraries
- Computer ScienceParallel Comput.
- 2021
Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels
- Computer ScienceArXiv
- 2019
This paper shows that by writing highly parameterized kernels for matrix multiplies and convolutions the authors achieve performance competitive with vendor implementations across different architectures.
Portable Parallel Performance via Multi-Dimensional Homomorphisms
- Computer Science
- 2018
Several state-of-the-art approaches aim at providing portability of performance, but they are mostly limited to restricted combinations of: 1) applications, 2) hardware architectures, and/or 3) input sizes.
References
SHOWING 1-10 OF 22 REFERENCES
cuDNN: Efficient Primitives for Deep Learning
- Computer ScienceArXiv
- 2014
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
CLTune: A Generic Auto-Tuner for OpenCL Kernels
- Computer Science2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip
- 2015
This work presents CLTune, an auto-tuner for OpenCL kernels that evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations, and supports multiple search strategies including simulated annealing and particle swarm optimisation.
A Note on Auto-tuning GEMM for GPUs
- Computer ScienceICCS
- 2009
Some GPU GEMM auto-tuning optimization techniques that allow the development of high performance dense linear algebra to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas are described.
Input-aware auto-tuning of compute-bound HPC kernels
- Computer ScienceSC
- 2017
An input-aware auto-tuning framework for matrix multiplications and convolutions, ISAAC, which uses predictive modeling techniques to drive highly parameterized PTX code templates towards not only hardware-, but also application-specific kernels.
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
- Computer Science2012 SC Companion: High Performance Computing, Networking Storage and Analysis
- 2012
This paper has developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL that shows higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable.
Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability
- Computer Science2015 IEEE International Parallel and Distributed Processing Symposium Workshop
- 2015
This paper uses machine learning-based auto-tuning to address poor performance portability in heterogeneous computing, and builds an artificial neural network based model that achieves a mean relative error as low as 6.1%, and is able to find configurations as little as 1.3% worse than the global minimum.
MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs
- Computer Science
- 2016
This work proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve linear algebra operations on many small-sized matrices and applied the batched methodology in a real-world Hydrodynamic application.
A Comparison of Potential Interfaces for Batched BLAS Computations
- Computer Science
- 2016
This work discusses many possible ways in which the BLAS standard can be extended for batch operations, giving benefits and criticisms of each, along with a number of experiments designed to determine how the API may affect performance on modern HPC systems.
Performance, Design, and Autotuning of Batched GEMM for GPUs
- Computer ScienceISC
- 2016
The general matrix-matrix multiplication (GEMM) kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general.
CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++11 applications on OpenCL™ 1.2 Devices
- Computer ScienceIWOCL
- 2017
Tensorflow framework is used as a case-study, and the ability to run unary, binary and reduction Tensorflow and Eigen kernels, with no modification to the original CUDA source-code is demonstrated.