• Publications
  • Influence
CLTune: A Generic Auto-Tuner for OpenCL Kernels
TLDR
This work presents CLTune, an auto-tuner for OpenCL kernels that evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations, and supports multiple search strategies including simulated annealing and particle swarm optimisation.
CUBu: Universal Real-Time Bundling for Large Graphs
TLDR
Fully GPU-based, CUBu bundles graphs of up to a million edges at interactive framerates, being over 50 times faster than comparable state-of-the-art methods, and has a simple and intuitive control of bundling parameters.
GPU-ASIFT: A fast fully affine-invariant feature extraction algorithm
TLDR
A CUDA version of this algorithm is created that is up to 70 times faster than the original implementation, while keeping the algorithm's accuracy close to that of ASIFT.
Comparative study between deep learning and bag of visual words for wild-animal recognition
TLDR
This paper developed two variants of the bag of visual words (BOW and HOG-BOW) and examined the use of gray and color information as well as different spatial pooling approaches and modified existing deep CNN architectures: AlexNet and GoogleNet.
Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train
TLDR
The challenges and novel solutions needed in order to train ResNet-50 in this large scale environment are described and the novel Collapsed Ensemble (CE) technique is introduced that allows for a 77.5\% top-1 accuracy, similar to that of a Res net-152, while training a unmodified Res Net-50 topology for the same fixed training budget.
Performance gain from data and control dependency elimination in embedded processors
TLDR
By removing dependencies within a processor and thus eliminating the need for extra hardware required for keeping the overall coherence, there will be a noticeable increase in performance (up to 450%) and also a decrease in size and power consumption.
Parallel centerline extraction on the GPU
Evaluating automatically parallelized versions of the support vector machine
TLDR
This work develops a directive‐based approach that converts a gradient‐ascent based training algorithm for the CPU to an efficient graphics processing unit (GPU) implementation, and shows an important speed‐up when compared to the CPU and OpenACC versions.
Evaluation of Autoparallelization Toolkits for Commodity GPUs
In this paper we evaluate the performance of the OpenACC and Mint toolkits against C and CUDA implementations of the standard PolyBench test suite. Our analysis reveals that performance is similar in
Evaluation of autoparallelization toolkits for commodity graphics hardware
In this paper we evaluate the performance of the OpenACC and Mint toolkits against C and CUDA implementations of the standard PolyBench test suite. Our analysis reveals that performance is similar in
...
1
2
3
4
5
...