John D. McCalpin

Learn More
HPC systems are notorious for operating at a small fraction of their peak performance, and the ongoing migration to multi-core and multi-socket compute nodes further complicates performance optimization. The readily available performance evaluation tools require considerable effort to learn and utilize. Hence, most HPC application writers do not use them.(More)
This paper considers the modifications required to transform a highly-efficient, specialized linear algebra core into an efficient engine for computing Fast Fourier Transforms (FFTs). We review the minimal changes required to support Radix-4 FFT computations and propose extensions to the micro-architecture of the baseline linear algebra core. Along the way,(More)
Coarse-grained multithreading, the switching of threads to avoid idle processor time during long-latency events, has been available on IBM systems since 1998. Simultaneous multithreading (SMT), first available on the POWER5e processor, moves beyond simple thread switching to the maintenance of two thread streams that are issued as continuously as possible(More)
The computation nodes of modern supercomputers commonly consist of multiple multicore processors. To maximize the performance of such systems requires measurement, analysis, and optimization techniques that specifically target multicore environments. This paper first examines traditional unicore metrics and demonstrates how they can be misleading in a(More)
Scalable cache-coherent nonuniform memory access (ccNUMA) architectures are an important design segment for high-performance scalable multiprocessor systems. In order to write application programs that take advantage of such systems, or port application programs written for symmetric multiprocessor systems with uniform memory access times, it is important(More)
The computation nodes of modern supercomputers consist of multiple multicore chips. Many scientific and engineering application codes have been migrated to these systems with little or no optimization for multicore architectures, effectively using only a fraction of the number of cores on each chip or achieving suboptimal performance from the cores they do(More)
FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT(More)
For the last decade, HPC systems have been dominated by clusters of two-socket commodity x86 servers, typically equipped with a non-commodity high-performance interconnect. Trends in lifecycle costs and prices, hardware technology, several measures of CPU and memory performance, and application performance characteristics are presented using several(More)
  • 1