Author pages are created from data sourced from our academic publisher partnerships and public sources.
Share This Author
A practical automatic polyhedral parallelizer and locality optimizer
An automatic polyhedral source-to-source transformation framework that can optimize regular programs for parallelism and locality simultaneously simultaneously and is implemented into a tool to automatically generate OpenMP parallel code from C program sections.
Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems
- Jiang Lin, Q. Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, P. Sadayappan
- Computer ScienceIEEE 14th International Symposium on High…
- 24 October 2008
This paper has comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS) and provides new insights into dynamic behaviors and interaction effects.
UTS: An Unbalanced Tree Search Benchmark
An unbalanced tree search benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing, and creates versions of UTS in two parallel languages, OpenMP and Unified Parallel C, using work stealing as the mechanism for reducing load imbalance.
Automatic C-to-CUDA Code Generation for Affine Programs
An automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs, that is quite close to hand-optimizedCUDA code and considerably better than the benchmarks' performance on a multicore CPU.
PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System
A fully automatic polyhedral source-to-source transformation framework that can optimize regular programs for parallelism and locality simultaneously simultaneously and addresses generation of tiled code for multiple statement domains of arbitrary dimensionalities under (statement-wise) affine transformations.
Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model
- Uday Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan
- Computer ScienceCC
- 29 March 2008
This work proposes an automatic transformation framework to optimize arbitrarily-nested loop sequences with affine dependences for parallelism and locality simultaneously and finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation.
Scalable work stealing
- James Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, J. Nieplocha
- Computer ScienceProceedings of the Conference on High Performance…
- 14 November 2009
This work investigates the design and scalability of work stealing on modern distributed memory systems and demonstrates high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
High-performance code generation for stencil computations on GPU architectures
This paper develops compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation, and shows that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.
On improving the performance of sparse matrix-vector multiplication
- J. White, P. Sadayappan
- Computer ScienceProceedings Fourth International Conference on…
- 18 December 1997
The data locality characteristics of the compressed sparse row representation is examined and improvements in locality through matrix permutation are considered and modified sparse matrix representations are evaluated.
Distributed job scheduling on computational Grids using multiple simultaneous requests
- Vijay Subramani, R. Kettimuthu, Srividya Srinivasan, P. Sadayappan
- Computer ScienceProceedings 11th IEEE International Symposium on…
- 24 July 2002
This paper proposes distributed scheduling algorithms that use multiple simultaneous requests at different sites that provide significant performance benefits and shows how this scheme can be adapted to provide priority to local jobs, without much loss of performance.