• Publications
  • Influence
A detailed GPU cache model based on reuse distance theory
This work extends reuse distance to GPUs by modelling the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, including conditional and non-uniform latencies, cache associativity, miss-status holding-registers, and warp divergence. Expand
CLTune: A Generic Auto-Tuner for OpenCL Kernels
This work presents CLTune, an auto-tuner for OpenCL kernels that evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations, and supports multiple search strategies including simulated annealing and particle swarm optimisation. Expand
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs
This paper presents two novel histogramming methods, both achieving a higher performance and predictability than existing methods and guarantees to be fully data independent. Expand
CLBlast: A Tuned OpenCL BLAS Library
ClBlast is an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices and can combine multiple operations in a single batched routine, accelerating smaller problems significantly. Expand
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons
A new classification of algorithms is used in a new source-to-source compiler, which is based on the algorithmic skeletons technique, and it is demonstrated that the presented compiler requires little modifications to the original sequential source code, generates readable code for further fine-tuning, and delivers superior performance compared to other tools for a set of 8 image processing kernels. Expand
EL-GAN: Embedding Loss Driven Generative Adversarial Networks for Lane Detection
This work proposes EL-GAN: a GAN framework to mitigate the discussed problem using an embedding loss, and uses the TuSimple lane marking challenge to demonstrate that with this proposed framework it is viable to overcome the inherent anomalies of posing it as a semantic segmentation problem. Expand
This article presents a technique to fully automatically generate efficient and readable code for parallel processors (with a focus on GPUs), made possible by combining algorithmic skeletons, traditional compilation, and “algorithmic species,” a classification of program code. Expand
The boat hull model: adapting the roofline model to enable performance prediction for parallel computing
This work modifications the roofline model to include class information to enable architectural choice through performance prediction prior to the development of architecture specific code, and shows for 6 example algorithms that performance is predicted accurately without requiring code to be available. Expand
Skeleton-based automatic parallelization of image processing algorithms for GPUs
This paper presents a technique to automatically parallelize and map sequential code on a GPU, without the need for code-annotations, and uses domain specific skeletons and a finer-grained classification of algorithms. Expand
PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs
We motivate the design and implementation of a platform-neutral compute intermediate language (PENCIL) for productive and performance-portable accelerator programming.