Learn More
In 2003, the DARPA's High Productivity Computing Systems released the HPCC suite. It examines the performance of HPC architectures using kernels with various memory access patterns of well known computational kernels. Consequently, HPCC results bound the performance of real applications as a function of memory access characteristics and define performance(More)
The historical context surrounding the birth of the DARPA High Productivity Computing Systems (HPCS) program is important for understanding why federal government agencies launched this new, longterm high performance computing program and renewed their commitment to leadership computing in support of national security, large science, and space requirements(More)
This paper presents a new distributed multifrontal sparse matrix decomposition algorithm suitable for message passing parallel processors. The algorithm uses a nested dissection ordering and a multifrontal distribution of the matrix to minimize interprocessor data dependencies and overcome the communication bottleneck previously reported for sparse matrix(More)
The use of GPUs to accelerate the factoring of large sparse symmetric indefinite matrices shows the potential of yielding important benefits to a large group of widely used applications. This paper examines how a multifrontal sparse solver performs when exploiting both the GPU and its multi-core host. It demonstrates that the GPU can dramatically accelerate(More)
In this paper, we describe a compilation system that automates much of the process of performance tuning that is currently done manually by application programmers interested in high performance. Due to the growing complexity of accurate performance prediction, our system incorporates empirical techniques to execute variants of code segments with(More)
This paper reports on the results of a workshop on programming models, languages, compilers and runtime systems for exascale machines. The goal was to identify some of the challenges in each of these areas, the promising approaches that should be pursued, and measures to assess progress. The challenges derived from more complex systems with additional(More)
We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i) inter-node parallelization via spatial decomposition; (ii) inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii) data locality(More)
In just one decade, the 1990s, supercomputer centers have undergone two fundamental transitions which require rethinking their operation and their role in high performance computing. The first transition in the early to mid-1990s resulted from a technology change in high performance computing architecture. Highly parallel distributed memory machines built(More)