• Publications
  • Influence
A detailed GPU cache model based on reuse distance theory
This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) warp divergence. Expand
CLTune: A Generic Auto-Tuner for OpenCL Kernels
This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations. ExampleExpand
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs
We present two novel histogramming methods, both achieving a higher performance and predictability than existing methods and guarantees to be fully data independent. Expand
CLBlast: A Tuned OpenCL BLAS Library
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. Expand
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons
We present a new source-to-source compiler, which is based on the algorithmic skeletons technique. Expand
EL-GAN: Embedding Loss Driven Generative Adversarial Networks for Lane Detection
We propose EL-GAN: a GAN framework to mitigate the inherent anomalies of posing it as a semantic segmentation problem using an embedding loss. Expand
The boat hull model: adapting the roofline model to enable performance prediction for parallel computing
We use an algorithm classification to predict performance prior to algorithm implementation. Expand
Skeleton-based automatic parallelization of image processing algorithms for GPUs
We present a technique to automatically parallelize and map sequential code on a GPU, without the need for code-annotations. Expand
PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs
We motivate the design and implementation of a platform-neutral compute intermediate language for productive and performance-portable accelerator programming. Expand
Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification
This paper presents a technique to fully automatically generate efficient and readable code for parallel processors. Expand