• Publications
  • Influence
A practical automatic polyhedral parallelizer and locality optimizer
TLDR
An automatic polyhedral source-to-source transformation framework that can optimize regular programs for parallelism and locality simultaneously simultaneously and is implemented into a tool to automatically generate OpenMP parallel code from C program sections.
Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems
TLDR
This paper has comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS) and provides new insights into dynamic behaviors and interaction effects.
UTS: An Unbalanced Tree Search Benchmark
TLDR
An unbalanced tree search benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing, and creates versions of UTS in two parallel languages, OpenMP and Unified Parallel C, using work stealing as the mechanism for reducing load imbalance.
Automatic C-to-CUDA Code Generation for Affine Programs
TLDR
An automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs, that is quite close to hand-optimizedCUDA code and considerably better than the benchmarks' performance on a multicore CPU.
PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System
TLDR
A fully automatic polyhedral source-to-source transformation framework that can optimize regular programs for parallelism and locality simultaneously simultaneously and addresses generation of tiled code for multiple statement domains of arbitrary dimensionalities under (statement-wise) affine transformations.
Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model
TLDR
This work proposes an automatic transformation framework to optimize arbitrarily-nested loop sequences with affine dependences for parallelism and locality simultaneously and finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation.
Scalable work stealing
TLDR
This work investigates the design and scalability of work stealing on modern distributed memory systems and demonstrates high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
High-performance code generation for stencil computations on GPU architectures
TLDR
This paper develops compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation, and shows that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.
On improving the performance of sparse matrix-vector multiplication
TLDR
The data locality characteristics of the compressed sparse row representation is examined and improvements in locality through matrix permutation are considered and modified sparse matrix representations are evaluated.
Distributed job scheduling on computational Grids using multiple simultaneous requests
TLDR
This paper proposes distributed scheduling algorithms that use multiple simultaneous requests at different sites that provide significant performance benefits and shows how this scheme can be adapted to provide priority to local jobs, without much loss of performance.
...
...