CLTune: A Generic Auto-Tuner for OpenCL Kernels
@article{Nugteren2015CLTuneAG, title={CLTune: A Generic Auto-Tuner for OpenCL Kernels}, author={Cedric Nugteren and Valeriu Codreanu}, journal={2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip}, year={2015}, pages={195-202} }
This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations. Example parameters include the OpenCL workgroup size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be used in the following scenarios: 1) when there are too many tunable parameters to explore manually, 2) when performance portability across OpenCL devices is desired, or 3) when the…
Figures and Tables from this paper
76 Citations
Kernel Tuner: A search-optimizing GPU code auto-tuner
- Computer ScienceFuture Gener. Comput. Syst.
- 2019
Autotuning OpenCL Workgroup Size for Stencil Patterns
- Computer ScienceArXiv
- 2015
This work proposes the use of machine learning-enabled autotuning to automatically predict workgroup sizes for stencil patterns on CPUs and multi-GPUs, and evaluates the effectiveness of each technique in an empirical study of 429 combinations of architecture, kernel, and dataset.
A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit
- Computer ScienceFuture Gener. Comput. Syst.
- 2020
CLBlast: A Tuned OpenCL BLAS Library
- Computer ScienceIWOCL
- 2018
ClBlast is an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices and can combine multiple operations in a single batched routine, accelerating smaller problems significantly.
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond
- Computer ScienceACM Trans. Archit. Code Optim.
- 2021
This work analyzes different machine learning techniques and predictive models to accelerate the convolution operator and GEMM, and addresses the problem of dataset generation, and studies the performance, accuracy, and generalization ability of the models.
Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications
- Computer ScienceConcurr. Comput. Pract. Exp.
- 2017
This paper uses machine learning‐based auto‐tuning to address poor performance portability in heterogeneous computing, and achieves a mean relative error as low as 3.8% and is able to find solutions on average only 0.29% slower than the best configuration in some cases.
Autotuning of OpenCL Kernels with Global Optimizations
- Computer ScienceANDARE '17
- 2017
A Kernel Tuning Toolkit (KTT), which implements inter-kernel global optimizations, allowing to tune parameters affecting multiple kernels or also the host code, is introduced, which extends state-of-the-art low-level tuning of OpenCL or CUDA kernels towards more complex optimizations.
Exploiting historical data: Pruning autotuning spaces and estimating the number of tuning steps
- Computer ScienceConcurr. Comput. Pract. Exp.
- 2020
It is demonstrated that it is possible to use historical data to reliably predict the number of tuning steps that are necessary to find a well‐performing configuration and to reduce the size of the tuning space.
ATF: A Generic Auto-Tuning Framework
- Computer Science2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
- 2017
The Auto-Tuning Framework is described — a novel generic approach for automatic program optimization by choosing the most suitable values of program parameters, such as number of parallel threads, tile sizes, etc — and its interface is arguably simpler than the interfaces of current auto-tuning frameworks.
ATF: A Generic Auto-Tuning Framework
- Computer ScienceHPDC
- 2018
We describe the Auto-Tuning Framework (ATF) -- a simple-to-use, generic framework for automatic program optimization by choosing the most suitable values of program parameters, such as number of…
References
SHOWING 1-10 OF 23 REFERENCES
Auto-tuning a high-level language targeted to GPU codes
- Computer Science2012 Innovative Parallel Computing (InPar)
- 2012
This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
- Computer ScienceParallel Comput.
- 2012
A Note on Auto-tuning GEMM for GPUs
- Computer ScienceICCS
- 2009
Some GPU GEMM auto-tuning optimization techniques that allow the development of high performance dense linear algebra to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas are described.
Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems
- Computer Science
- 2012
This paper describes how to make SkePU tunable, by adding the mechanism of execution plans that can configure a skeleton so that, at run time, the predicted best suitable resource and platform is chosen automatically, depending on operand data sizes.
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
- Computer Science2012 SC Companion: High Performance Computing, Networking Storage and Analysis
- 2012
This paper has developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL that shows higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable.
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
- Computer Science2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
- 2010
This paper has developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations.
MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs
- Computer ScienceJournal of Computer Science and Technology
- 2013
This paper proposes an automatic performance tuning framework for FFT on various OpenCL GPUs, and implements a high performance library named MPFFT based on this framework, which substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs.
Convolution engine: balancing efficiency & flexibility in specialized computing
- Computer ScienceISCA
- 2013
The Convolution Engine, specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications, is presented and it is demonstrated that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel.
Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs
- Computer ScienceProceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
- 2013
An approach to estimate GPU applications' performance upper bound based on algorithm analysis and assembly code level benchmarking and how to use native assembly language directly in the CUDA runtime source code is presented.
A script-based autotuning compiler system to generate high-performance CUDA code
- Computer ScienceTACO
- 2013
A Transformation Strategy Generator, a meta-optimizer that generates a set of transformation recipes, which are descriptions of the mapping of the sequential code to parallel CUDA code, which comprise a search space of possible implementations.