Sriram Krishnamoorthy

Learn More
Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed(More)
This paper provides an overview of a program synthesis system for a class of quantum chemistry computations. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. The input to the system is a a high-level specification of the computation, from which the system can synthesize high-performance parallel(More)
GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a(More)
The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a complex sequence of execution-reordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant body of research has addressed affine scheduling(More)
Solving large, irregular graph problems efficiently is challenging. Current software systems and commodity multiprocessors do not support fine-grained, irregular parallelism well. We present XWS, the X10 Work Stealing framework, an open-source runtime for the parallel programming language X10 and a library to be used directly by application writers. XWS(More)
We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an(More)
Performance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization of data locality and parallelism. However, loop skewing is typically(More)
Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of parallelism. A significant challenge in programming these architectures is to effectively exploit the parallelism available in the(More)
The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently,(More)
—With increasing numbers of cores, future CMPs (Chip Multi-Processors) are likely to have a tiled architecture with a portion of shared L2 cache on each tile and a bank-interleaved distribution of the address space. Although such an organization is effective for avoiding access hot-spots, it can cause a significant number of non-local L2 accesses for many(More)