• Corpus ID: 37319603

GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING

@inproceedings{Graham2004GETTINGUT,
  title={GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING},
  author={Susan L. Graham and Marc Snir and Cynthia A. Patterson},
  year={2004}
}

The Economic Impact of Moore's Law: Evidence from When it Faltered

“Computing performance doubles every couple of years” is the popular re-phrasing of Moore’s Law, which describes the 500,000-fold increase in the number of transistors on modern computer chips. But

Communication optimization in iterative numerical algorithms : an algorithm-architecture interaction

TLDR
This work shows how to select the unroll factor k in an architecture-agnostic manner to provide communication-computation tradeoff on FPGA and GPU, and presents a new algorithm for the FPGAs which matches with their strength to reduce redundant computation to allow large k and hence higher speedups.

Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1

TLDR
This work generalizes the lower bound approach used initially for Theta(N3) matrix multiplication to a much larger class of algorithms, that may have arbitrary numbers of loops and arrays with arbitrary dimensions as long as the index expressions are a ne combinations of loop variables.

Multilevel communication optimal LU and QR factorizations for hierarchical platforms

This study focuses on the performance of two classical dense linear algebra algorithms, the LU and the QR factorizations, on multilevel hierarchical platforms. We first introduce a new model called

Overlapping clusters for distributed computation

TLDR
This work describes a graph decomposition algorithm for the paradigm where the partitions intersect and describes a framework for distributed computation across a collection of overlapping clusters and how this framework can be used in various algorithms based on the graph diffusion process.

Some issues in dense linear algebra for multicore and special purpose architectures

We address some key issues in designing dense linear algebra (DLA) algorithms that are common for both multi/many-cores and special purpose architectures (in particular GPUs). We present them in the

A Threaded Parallel Code for Pricing Discrete Asian Options on SMP Systems

TLDR
Three implementations of a parallel algorithm for pricing discrete Asian options are described: one using message passing interface (MPI), one using OpenMP and one using POSIX threads through a high level FORTRAN API.

Prospectus for the Next LAPACK and ScaLAPACK Libraries

TLDR
Based on an on-going user survey and research by many people, the following improvements are proposed: Faster algorithms, including better numerical methods, memory hierarchy optimizations, parallelism, and automatic performance tuning to accommodate new architectures.

Red Storm Capability Computing Queuing Policy.

TLDR
The basic queuing policy design is described along with extensions to handle switching between classified and unclassified, use by ASC university partners, priority access, etc.

Brief Announcement: On the I/O Complexity of Sequential and Parallel Hybrid Integer Multiplication Algorithms

  • Lorenzo De Stefani
  • Computer Science
    Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures
  • 2022
TLDR
This work presents an Ω((n/max(M,n0) (max(1,n_0/M))2M) lower bound for the I/O complexity of a class of "uniform, non-stationary" hybrid algorithms, where n0 denotes the threshold size of sub-problems which are computed using standard algorithms with algebraic complexity Ω(n2).
...

References

SHOWING 1-3 OF 3 REFERENCES

Co-array Fortran for parallel programming

TLDR
The extension of Co-Array Fortran is introduced; examples to illustrate how clear, powerful, and flexible it can be; and a technical definition is provided.

Global arrays: A nonuniform memory access programming model for high-performance computers

TLDR
The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes.

OpenMP: an industry standard API for shared-memory programming

At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism. It