• Corpus ID: 18464112

Fourier Transforms for the BlueGene / L Communication Network

@inproceedings{Jagode2006FourierTF,
  title={Fourier Transforms for the BlueGene / L Communication Network},
  author={Heike Jagode},
  year={2006}
}
A computational kernel of particular importance for many scientific applications is the Fast Fourier Transform (FFT) of multi-dimensional data. A fundamental challenge is the design and implementation of such parallel numerical algorithms to utilise efficiently thousands of nodes. The BlueGene/L is a massively parallel high performance computer organised as a three-dimensional torus of compute nodes. To maintain application performance and scaling, the correct mapping of MPI tasks onto the… 

Task placement of parallel multi-dimensional FFTs on a mesh communication network

A simple model for the scope of performance of a large class of mappings on the basis of bandwidth considerations is derived and enables us to identify scaling bottlenecks and hotspots of par allel, communication intensive 3D-FFT applications when MPI tasks are mapped in the default way onto the network.

Custom Assignment of MPI Ranks for Parallel Multi-dimensional FFTs: Evaluation of BG/P versus BG/L

  • Heike JagodeJ. Hein
  • Computer Science
    2008 IEEE International Symposium on Parallel and Distributed Processing with Applications
  • 2008
This paper investigates the extent of performance improvements for a parallel three-dimensional FFT (3D-FFT) implementation when using customized MPI task mappings and demonstrates that on Blue Gene/P, a carefully chosenMPI task mapping with regards to the network characteristics is more important compared to BlueGene/L and yields significant improvement.

Parallel Fourier Transformations using shared memory nodes

Whether and how the performance of the parallel two-dimensional (2D) FFT can be improved, by exploiting the access to the shared memory nodes of HPCx, a cluster of POWER 5 SMP nodes, and the Hybrid model, a mixed mode programming model between shared memory programming and messaging passing are investigated.

Demanding Parallel FFTs: Slabs & Rods

The results indicate that FFT libraries installed on platforms the authors tested are largely comparable in performance, whether vendor-provided or open-source, except for the esoteric Blue Gene/L architecture, upon which ESSL proved to be superior.

Parallel 3 D-FFTs for multi-core nodes on a mesh communication network

This work presents benchmarking results from the UK’s national supercomputing services HECToR and HPCx and explores how the topology of a mesh communication network can affect the communication performance and how nodes offering several processing cores can be exploited to improve the communicationperformance.

Parallel FFT Libraries

The main conclusions drawn from this study were the ability of the P3DFFT and the 2DECOMP&FFT libraries to scale up to thousands of cores and the affects of the Gemini Interconnect on their performance.

Large-scale FFT on GPU clusters

Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers.

3 D FFT with 2 D decomposition

Having this 2D decomposed 3D FFTW allows to improve the scaling of the fastest available MD software and is implemented as this project.

Predictive Model for FFT Scalability Performance

This work constructs a model that can be used to decide the execution plan scalability and highlight major factors that impact directly the algorithm performance where other platform and hardware-dependent factors are not included.

Power monitoring with PAPI for extreme scale architectures and dataflow-based programming models

Detailed information is provided about three components that allow power monitoring on the Intel Xeon Phi and Blue Gene/Q and the integration of PAPI in PARSEC - a task-based dataflow-driven execution engine - enabling hardware performance counter and power monitoring at true task granularity.

References

SHOWING 1-10 OF 23 REFERENCES

A Volumetric FFT for BlueGene/L

This paper relies on a volume decomposition of the data to take advantage of the toroidal communication topology of BlueGene/L to produce a scalable Fast Fourier Transform (FFT) implementation.

Performance Measurements of the 3D FFT on the Blue Gene/L Supercomputer

This paper presents performance characteristics of a communications-intensive kernel, the complex data 3D FFT, running on the Blue Gene/L architecture. Two implementations of the volumetric FFT

Automatically Tuned FFTs for BlueGene/L's Double FPU

This paper presents one of the first numerical kernels run on a prototype BlueGene/L machine, tuning the formal vectorization approach as well as the Vienna MAP vectorizer to support Blue Gene/L's custom two-way short vector SIMD “double” floating-point unit.

Design and implementation of message-passing services for the Blue Gene/L supercomputer

Performance measurements show that message-passing services deliver performance close to the hardware limits of the machine, and dedicating one of the processors of a node to communication functions greatly improves the message-Passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.

A Performance and Scalability Analysis of the BlueGene/L Architecture

A performance and scalability analysis of the architecture from low-level characteristics to large-scale applications and a comparison between the performance of BlueGene/L and the ASCI Q, the largest supercomputer in the US, is presented, based on predictive performance models.

Overview of the Blue Gene/L system architecture

The key architectural features of BlueGene/L are introduced: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.

Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code

Qbox is an FPMD implementation specifically designed for large-scale parallel platforms such as BlueGene/L, and measures of performance by means of hardware counters show that 36% of the peak FPU performance can be attained.

A Student's Guide to Fourier Transforms: With Applications in Physics and Engineering

Preface to the first edition Preface to the second edition 1. Physics and Fourier transforms 2. Useful properties and theorems 3. Applications I: Fraunhofer diffraction 4. Applications II: signal

Computer Architecture: A Quantitative Approach

This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important

Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities

An instrument for facilitating the calculation of equivalent values includes a plate bearing symbols representing units and dimensions, the plate having a window in which a movable pointer is