• Publications
  • Influence
A detailed GPU cache model based on reuse distance theory
TLDR
This work extends reuse distance to GPUs by modelling the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, including conditional and non-uniform latencies, cache associativity, miss-status holding-registers, and warp divergence. Expand
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs
TLDR
This paper presents two novel histogramming methods, both achieving a higher performance and predictability than existing methods and guarantees to be fully data independent. Expand
Adaptive and transparent cache bypassing for GPUs
TLDR
This paper proposes a novel compile-time framework for adaptive and transparent cache bypassing on GPUs that uses a simple yet effective approach to control the bypass degree to match the size of applications' runtime footprints. Expand
Fine-Grained Synchronizations and Dataflow Programming on GPUs
TLDR
This paper proposes a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs, and demonstrates its performance, and applies it to Needleman-Wunsch - a 2D wavefront application involving massive cross-loop data dependencies. Expand
Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs
TLDR
The results show that optimizing the GPU code for speed can achieve a speed-up over naive GPU code of about 10×, and the implementation which achieves a constant processing time is quicker for about 20% of the images. Expand
GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU
TLDR
This work describes a transformation to merge categories which enables gpu-vote to have a single implementation for all voting algorithms, and gives an accurate and intuitive performance prediction model for the generalized GPU voting algorithm. Expand
Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs
TLDR
This paper explores the use of configurable bit-vector and bitwise XOR-based hash functions to evenly distribute memory addresses of the access patterns over the memory banks, reducing the number of bank conflicts. Expand
Future of GPGPU micro-architectural parameters
TLDR
This work identifies and discusses trade-offs for three GPU architecture parameters: active thread count, compute-memory ratio, and cluster and warp sizing, and proposes changes to improve GPU design, keeping in mind trends such as dark silicon and the increasing popularity of GPGPU architectures. Expand
A Study of the Potential of Locality-Aware Thread Scheduling for GPUs
TLDR
It is concluded that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler, and re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. Expand
Compile-time GPU memory access optimizations
TLDR
To describe these optimizations, a new notation of the parallel execution of GPU programs is introduced and an implementation of the optimizations shows that performance improvements of up to 40 times are possible. Expand
...
1
2
...