Corpus ID: 13090115

A Performance Model for the Communication in Fast Multipole Methods on HPC Platforms

  title={A Performance Model for the Communication in Fast Multipole Methods on HPC Platforms},
  author={Huda Ibeid and Rio Yokota and David E. Keyes},
Exascale systems are predicted to have approximately one billion cores, assuming Gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the current parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. There is therefore an urgent need to model application performance and to understand what changes need to be made to ensure… Expand
Scalable Fast Multipole Method for Electromagnetic Simulations
The first results of ML-FMM algorithm implementation using GASPI asynchronous one-sided communications to improve code scalability and performance are presented, showing an 83.5% reduction on communication costs over the optimized MPI+OpenMP version. Expand
Parallel Implementation of the Fast Multipole Method Parallele Implementierung der Fast Multipole Methode
In this thesis we develop an MPI parallelization for the Fast Multipole Method in the Molecular Dynamics software MarDyn. Different optimizations to the implementation were investigated to minimizeExpand
Communication Complexity of the Fast Multipole Method and its Algebraic Variants
This work describes implementation aspects of a hybrid of these two compelling hierarchical algorithms on hierarchical distributed-shared memory architectures, which is likely to be the first to reach the exascale, and presents a new communication complexity estimate for fast multipole methods on such architectures. Expand


Modeling the performance of an algebraic multigrid cycle on HPC platforms
This paper considers algebraic multigrid (AMG), a popular and highly efficient iterative solver for large sparse linear systems that is used in many applications, and presents a performance model for an AMG solve cycle and performance measurements on several massively-parallel platforms. Expand
A massively parallel adaptive fast-multipole method on heterogeneous architectures
New scalable algorithms and a new implementation of the kernel-independent fast multipole method are presented, in which both distributed memory parallelism and shared memory/streaming parallelism are employed to rapidly evaluate two-body non-oscillatory potentials. Expand
Introduction to the HPCChallenge Benchmark Suite
Abstract : The HPCChallenge suite of benchmarks will examine the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance LinpackExpand
Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures
This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems, and shows that optimization and parallelization can improve double-precision performance by 25× on Intel's quad-core Nehalem, 9.4× on AMD'squad-core Barcelona, and 37.6× on Sun's Victoria Falls. Expand
Efficient parallel implementations of multipole based n-body algorithms
This dissertation presents a novel O(N2) method for the computation of an approximate pressure tensor, which uses multipole expansions and hierarchical decomposition to produce results with a known error bound, while allowing integration with existing multipole-based N-body solvers. Expand
Towards Realistic Performance Bounds for Implicit CFD Codes
The chapter illustrates the performance limitations caused by insufficient available memory bandwidth with a discussion of sparse matrix-vector multiply, a critical operation in many iterative methods used in implicit CFD codes, and focuses on the per-processor performance of compute nodes used in parallel computers. Expand
Performance scalability prediction on multicomputers
By integrating compilation, performance analysis and symbolic manipulation tools, it is possible to correctly predict, in an automated fashion, the major performance variations of a data parallel program written in a high-level language. Expand
Symbolic performance prediction of scalable parallel programs
  • M. Clement, M. J. Quinn
  • Computer Science
  • Proceedings of 9th International Parallel Processing Symposium
  • 1995
This research develops a performance prediction methodology that addresses this problem through symbolic analysis of program source code to determine performance for scaled up applications on different hardware architectures. Expand
Scaling Hierarchical N-body Simulations on GPU Clusters
Key performance issues in the context of clusters of GPUs are investigated, including kernel organization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. Expand
Integrated compilation and scalability analysis for parallel systems
  • C. Mendes, D. Reed
  • Computer Science
  • Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192)
  • 1998
A new methodology to automatically predict the performance scalability of data parallel applications on multicomputers is presented, which represents the execution time of a program as a symbolic expression that includes the number of processors, problem size, and other system-dependent parameters. Expand