• Corpus ID: 26189366

Efficiency improvements of iterative numerical algorithms on modern architectures

  title={Efficiency improvements of iterative numerical algorithms on modern architectures},
  author={Jan Treibig},
For many numerical codes the transport of data from main memory to the registers is commonly considered to be the main limiting factor to achieve high performance on present micro architectures. This fact is referred to as the memory wall. A lot of research is targeting this point on different levels. This covers for example code transformations and architecture aware data structures to achieve an optimal usage of the memory hierarchy found in all present micro architectures. This work shows… 

Modeling the Performance of Geometric Multigrid Stencils on Multicore Computer Architectures

This work suggests the use of state-of-the-art (stencil) compiler techniques to improve the flop per byte ratio, also called the arithmetic intensity, of the steps in the algorithm, focusing on the smoother which is a repeated stencil application.

Compiler generation and autotuning of communication-avoiding operators for geometric multigrid

A compiler approach to introducing communication-avoiding optimizations in geometric multigrid (GMG), one of the most popular methods for solving partial differential equations, is described, able to quadruple the performance of the smooth operation on the finest grids while attaining performance within 94% of manually-tuned code.

Performance Engineering of Numerical Software on Multi- and Manycore Processors

This thesis contributes to this field from the perspective of a computer scientist who is involved in scientific computing by providing the necessary basis for an understanding of the development process of numerical software and of the emergence of performance at the hardware/software interface.

Compiler optimizations and autotuning for stencils and geometric multigrid

This dissertation develops communication-avoiding optimizations to reduce data movement in memory-bound stencils, and demonstrates the efficacy of the approach by comparing the performance of generated code against manually tuned code, over commercial compiler-generated code, and against analytic performance bounds.

Optimization of geometric multigrid for emerging multi- and manycore processors

This paper explores optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel®, Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi coprocessor (Knights Corner).

Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark

miniGMG, a compact geometric multigrid benchmark designed to proxy the multigrids solves found in AMR applications is described, and a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators are examined.

Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers

The design goal of WaLBerla is to be usable, maintainable, and extendable as well as to enable efficient and scalable implementations on massively parallel supercomputers and it is shown that the design goals have been fulfilled.

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

This report explores communication-avoiding implementations of Geometric Multigrid on Nvidia GPUs and achieves an overall gain of 1.2x for the whole multigrid algorithm over baseline implementation.



Data locality optimizations for iterative numerical algorithms and cellular automata on hierarchical memory architectures

This thesis presents approaches towards the optimization of the data locality of implementations of grid-based numerical algorithms, focusing on multigrid methods based on structured meshes as well as cellular automata in both 2D and 3D.

Optimizing Compilers for Modern Architectures: A Dependence-based Approach

A broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling are provided.

Loop Optimization using Hierarchical Compilation and Kernel Decomposition

This work proposes a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers that relies on the decomposition of the original loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize.

Efficient Utilization of SIMD Extensions

Special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions is described, which leads to substantial speedups over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.

An Experimental Study of Self-Optimizing Dense Linear Algebra Software

This paper addresses questions for matrix multiplication, which is the most important dense linear algebra kernel.

Optimistic parallelism benefits from data partitioning

Recent studies of irregular applications such as finite-element mesh generators and data-clustering codes have shown that these applications have a generalized data parallelism arising from the use

Optimizing performance on modern HPC systems: learning from simple kernel benchmarks

The need for a subtle OpenMP implementation even for simple benchmark programs is stressed, to exploit the high aggregate memory bandwidth available nowadays on ccNUMA systems.

An efficient memory operations optimization technique for vector loops on Itanium 2 processors: Research Articles

It is demonstrated that, if no care is taken at compile time, the non-precise memory disambiguation mechanism and the banking structure cause severe performance loss, even for very simple regular codes.

On the Scalability of an Automatically Parallelized Irregular Application

This paper studies the performance and scalability of a Galoised, that is, automatically parallelized, version of Delaunay mesh refinement (DR) on a shared-memory system with 128 CPUs and believes the Galois approach to be very promising.

Think globally, search locally

This paper advocates a methodology for generating high-performance code without increasing search time dramatically, and demonstrates this methodology by using it to eliminate the performance gap between code produced by a model-driven version of ATLAS described by us in prior work.