Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

@article{Datta2008StencilCO,
  title={Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures},
  author={Kaushik Datta and Mark Murphy and Vasily Volkov and Samuel Williams and Jonathan Carter and Leonid Oliker and David A. Patterson and John Shalf and Katherine A. Yelick},
  journal={2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis},
  year={2008},
  pages={1-12}
}
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations - a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations… CONTINUE READING
Highly Influential
This paper has highly influenced a number of papers. REVIEW HIGHLY INFLUENTIAL CITATIONS
Highly Cited
This paper has 616 citations. REVIEW CITATIONS

6 Figures & Tables

Extracted Numerical Results

  • Each core is based on Intel’s Core2 microarchitecture, runs at 2.66 GHz, can fetch and decode four instructions per cycle, execute 6 micro-ops per cycle, and fully support 128b SSE, for peak double-precision performance of 10.66 GFlop/s per core.
  • SSE instructions, for peak double-precision performance of 9.2 GFlop/s per core or 36.8 GFlop/s per socket.
  • The enhanced SPEs can now execute two double precision FMAs per cycle, for a peak of 12.8 GFlop/s per SPE. The QS22 blade used in this study is comprised of two sockets with eight SPEs each (204.8 GFlop/s double-precision peak).
  • However, our use of SDK 2.1 resulted in poor double precision code scheduling as the compiler was scheduling for a QS20 rather than a QS22.
  • However, the scatter plot suggests the code is achieving nearly 100% of this algorithm’s double precision peak flop rate while consuming better than 66% of its memory bandwidth.

Topics

Statistics

05020082009201020112012201320142015201620172018
Citations per Year

617 Citations

Semantic Scholar estimates that this publication has 617 citations based on the available data.

See our FAQ for additional information.