CLTune: A Generic Auto-Tuner for OpenCL Kernels

@article{Nugteren2015CLTuneAG,
  title={CLTune: A Generic Auto-Tuner for OpenCL Kernels},
  author={Cedric Nugteren and Valeriu Codreanu},
  journal={2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip},
  year={2015},
  pages={195-202}
}
  • C. Nugteren, V. Codreanu
  • Published 23 September 2015
  • Computer Science
  • 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip
This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations. Example parameters include the OpenCL workgroup size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be used in the following scenarios: 1) when there are too many tunable parameters to explore manually, 2) when performance portability across OpenCL devices is desired, or 3) when the… 
Kernel Tuner: A search-optimizing GPU code auto-tuner
Autotuning OpenCL Workgroup Size for Stencil Patterns
TLDR
This work proposes the use of machine learning-enabled autotuning to automatically predict workgroup sizes for stencil patterns on CPUs and multi-GPUs, and evaluates the effectiveness of each technique in an empirical study of 429 combinations of architecture, kernel, and dataset.
A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit
CLBlast: A Tuned OpenCL BLAS Library
TLDR
ClBlast is an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices and can combine multiple operations in a single batched routine, accelerating smaller problems significantly.
On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond
TLDR
This work analyzes different machine learning techniques and predictive models to accelerate the convolution operator and GEMM, and addresses the problem of dataset generation, and studies the performance, accuracy, and generalization ability of the models.
Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications
TLDR
This paper uses machine learning‐based auto‐tuning to address poor performance portability in heterogeneous computing, and achieves a mean relative error as low as 3.8% and is able to find solutions on average only 0.29% slower than the best configuration in some cases.
Autotuning of OpenCL Kernels with Global Optimizations
TLDR
A Kernel Tuning Toolkit (KTT), which implements inter-kernel global optimizations, allowing to tune parameters affecting multiple kernels or also the host code, is introduced, which extends state-of-the-art low-level tuning of OpenCL or CUDA kernels towards more complex optimizations.
Exploiting historical data: Pruning autotuning spaces and estimating the number of tuning steps
TLDR
It is demonstrated that it is possible to use historical data to reliably predict the number of tuning steps that are necessary to find a well‐performing configuration and to reduce the size of the tuning space.
ATF: A Generic Auto-Tuning Framework
  • Ari Rasch, Michael Haidl, S. Gorlatch
  • Computer Science
    2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2017
TLDR
The Auto-Tuning Framework is described — a novel generic approach for automatic program optimization by choosing the most suitable values of program parameters, such as number of parallel threads, tile sizes, etc — and its interface is arguably simpler than the interfaces of current auto-tuning frameworks.
ATF: A Generic Auto-Tuning Framework
We describe the Auto-Tuning Framework (ATF) -- a simple-to-use, generic framework for automatic program optimization by choosing the most suitable values of program parameters, such as number of
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 23 REFERENCES
Auto-tuning a high-level language targeted to GPU codes
TLDR
This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.
A Note on Auto-tuning GEMM for GPUs
TLDR
Some GPU GEMM auto-tuning optimization techniques that allow the development of high performance dense linear algebra to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas are described.
Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems
TLDR
This paper describes how to make SkePU tunable, by adding the mechanism of execution plans that can configure a skeleton so that, at run time, the predicted best suitable resource and platform is chosen automatically, depending on operand data sizes.
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
TLDR
This paper has developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL that shows higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable.
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
  • Seyong Lee, R. Eigenmann
  • Computer Science
    2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2010
TLDR
This paper has developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations.
MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs
TLDR
This paper proposes an automatic performance tuning framework for FFT on various OpenCL GPUs, and implements a high performance library named MPFFT based on this framework, which substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs.
Convolution engine: balancing efficiency & flexibility in specialized computing
TLDR
The Convolution Engine, specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications, is presented and it is demonstrated that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel.
Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs
  • Junjie Lai, André Seznec
  • Computer Science
    Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
  • 2013
TLDR
An approach to estimate GPU applications' performance upper bound based on algorithm analysis and assembly code level benchmarking and how to use native assembly language directly in the CUDA runtime source code is presented.
A script-based autotuning compiler system to generate high-performance CUDA code
TLDR
A Transformation Strategy Generator, a meta-optimizer that generates a set of transformation recipes, which are descriptions of the mapping of the sequential code to parallel CUDA code, which comprise a search space of possible implementations.
...
1
2
3
...